Do your work for this exercise in a file named model.

Take the work we did in the lessons further:
- What other types of models (i.e. different classifcation algorithms) could you use?
- How do the models prepare when trained on term frequency data alone, instead of TF-IDF values?

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer 

import acquire
import prepare

#### Acquire

In [2]:
df_fresh = pd.DataFrame(acquire.get_news_articles())
df_fresh

Unnamed: 0,title,body,category
0,My faith in humanity is restored: Musk after w...,After a US jury found that Elon Musk did not d...,business
1,I'll take it on the chin: Cave explorer after ...,"British cave explorer Vernon Unsworth, who los...",business
2,FIR filed against Club Factory in Lucknow for ...,A Lucknow-based customer has filed an FIR agai...,business
3,Onion prices surge up to ₹200 per kg in Bengaluru,Price of onion has shot up to ₹200 per kg in B...,business
4,UP Cabinet approves Zurich Airport Int'l as de...,The Uttar Pradesh Cabinet on Monday approved Z...,business
...,...,...,...
94,"Salman, Katrina perform at B'desh Premier Leag...",Actors Salman Khan and Katrina Kaif performed ...,entertainment
95,It's deeply disappointing: Dia Mirza on Hydera...,After the four men accused of gangraping and m...,entertainment
96,Sushmita Sen announces her return to films aft...,Sushmita Sen on Monday took to Instagram to an...,entertainment
97,"I'm not a political person, don't believe in '...",Singer Usha Uthup has said she's working as a ...,entertainment


In [3]:
df_fresh.body[0]

'After a US jury found that Elon Musk did not defame British cave explorer Vernon Unsworth by calling him a "pedo guy" on Twitter, the Tesla CEO said, "My faith in humanity is restored". The 48-year-old billionaire argued that he did not intend to call Unsworth a paedophile, but instead was using "pedo guy" to mean "creepy old guy".'

#### Prepare

In [4]:
prepare.prep_article(df_fresh)

Unnamed: 0,title,category,original,stemmed,lemmatized,clean
0,My faith in humanity is restored: Musk after w...,business,After a US jury found that Elon Musk did not d...,after a us juri found that elon musk did not d...,after a u jury found that elon musk did not de...,us jury found elon musk defame british cave ex...
1,I'll take it on the chin: Cave explorer after ...,business,"British cave explorer Vernon Unsworth, who los...",british cave explor vernon unsworth who lost t...,british cave explorer vernon unsworth who lost...,british cave explorer vernon unsworth lost def...
2,FIR filed against Club Factory in Lucknow for ...,business,A Lucknow-based customer has filed an FIR agai...,a lucknowbas custom ha file an fir against chi...,a lucknowbased customer ha filed an fir agains...,lucknowbased customer filed fir chinese ecomme...
3,Onion prices surge up to ₹200 per kg in Bengaluru,business,Price of onion has shot up to ₹200 per kg in B...,price of onion ha shot up to 200 per kg in ben...,price of onion ha shot up to 200 per kg in ben...,price onion shot 200 per kg bengaluru due seve...
4,UP Cabinet approves Zurich Airport Int'l as de...,business,The Uttar Pradesh Cabinet on Monday approved Z...,the uttar pradesh cabinet on monday approv zur...,the uttar pradesh cabinet on monday approved z...,uttar pradesh cabinet monday approved zurich a...
...,...,...,...,...,...,...
94,"Salman, Katrina perform at B'desh Premier Leag...",entertainment,Actors Salman Khan and Katrina Kaif performed ...,actor salman khan and katrina kaif perform at ...,actor salman khan and katrina kaif performed a...,actors salman khan katrina kaif performed open...
95,It's deeply disappointing: Dia Mirza on Hydera...,entertainment,After the four men accused of gangraping and m...,after the four men accus of gangrap and murder...,after the four men accused of gangraping and m...,four men accused gangraping murdering 27yearol...
96,Sushmita Sen announces her return to films aft...,entertainment,Sushmita Sen on Monday took to Instagram to an...,sushmita sen on monday took to instagram to an...,sushmita sen on monday took to instagram to an...,sushmita sen monday took instagram announce re...
97,"I'm not a political person, don't believe in '...",entertainment,Singer Usha Uthup has said she's working as a ...,singer usha uthup ha said she work as a singer...,singer usha uthup ha said shes working a a sin...,singer usha uthup said shes working singer gen...


In [5]:
df_fresh.clean[0]

'us jury found elon musk defame british cave explorer vernon unsworth calling pedo guy twitter tesla ceo said faith humanity restored 48yearold billionaire argued intend call unsworth paedophile instead using pedo guy mean creepy old guy'

In [6]:
df = df_fresh[["category", "clean"]]
df

Unnamed: 0,category,clean
0,business,us jury found elon musk defame british cave ex...
1,business,british cave explorer vernon unsworth lost def...
2,business,lucknowbased customer filed fir chinese ecomme...
3,business,price onion shot 200 per kg bengaluru due seve...
4,business,uttar pradesh cabinet monday approved zurich a...
...,...,...
94,entertainment,actors salman khan katrina kaif performed open...
95,entertainment,four men accused gangraping murdering 27yearol...
96,entertainment,sushmita sen monday took instagram announce re...
97,entertainment,singer usha uthup said shes working singer gen...


#### Explore 

In [7]:
df.category.value_counts(normalize=True)

entertainment    0.252525
technology       0.252525
sports           0.252525
business         0.242424
Name: category, dtype: float64

In [8]:
pd.concat(
    [df.category.value_counts(), df.category.value_counts(normalize=True)], axis=1
).set_axis(["n", "percent"], axis=1, inplace=False)

Unnamed: 0,n,percent
entertainment,25,0.252525
technology,25,0.252525
sports,25,0.252525
business,24,0.242424


#### Define Features

In [9]:
df

Unnamed: 0,category,clean
0,business,us jury found elon musk defame british cave ex...
1,business,british cave explorer vernon unsworth lost def...
2,business,lucknowbased customer filed fir chinese ecomme...
3,business,price onion shot 200 per kg bengaluru due seve...
4,business,uttar pradesh cabinet monday approved zurich a...
...,...,...
94,entertainment,actors salman khan katrina kaif performed open...
95,entertainment,four men accused gangraping murdering 27yearol...
96,entertainment,sushmita sen monday took instagram announce re...
97,entertainment,singer usha uthup said shes working singer gen...


In [10]:
raw_count = pd.Series(" ".join(df.clean).split()).value_counts()

In [11]:
raw_count

said         82
added        36
india        18
also         17
one          15
             ..
4900          1
explained     1
reclaim       1
values        1
claus         1
Length: 2049, dtype: int64

In [12]:
tf_df = (pd.DataFrame({'raw_count': raw_count})
         .assign(frequency=lambda df: df.raw_count / df.raw_count.sum())
         .assign(augmented_frequency=lambda df: df.frequency / df.frequency.max()))

tf_df.head()

Unnamed: 0,raw_count,frequency,augmented_frequency
said,82,0.022447,1.0
added,36,0.009855,0.439024
india,18,0.004927,0.219512
also,17,0.004654,0.207317
one,15,0.004106,0.182927


In [13]:
tfidf = TfidfVectorizer()
tfidfs = tfidf.fit_transform(df.clean.to_dict().values())