# TfidfVectorizer Explanation
Convert a collection of raw documents to a matrix of TF-IDF features

TF-IDF where TF means term frequency, and IDF means Inverse Document frequency.

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
text = ['Hello Kushal Bhavsar here, I love machine learning','Welcome to the Machine learning hub' ]

In [21]:
vect = TfidfVectorizer()

In [22]:
vect.fit(text)

TfidfVectorizer()

In [23]:
## TF will count the frequency of word in each document. and IDF 
print(vect.idf_)

[1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.
 1.40546511 1.         1.40546511 1.40546511 1.40546511]


In [24]:
print(vect.vocabulary_)

{'hello': 1, 'kushal': 4, 'bhavsar': 0, 'here': 2, 'love': 6, 'machine': 7, 'learning': 5, 'welcome': 10, 'to': 9, 'the': 8, 'hub': 3}


### A words which is present in all the data, it will have low IDF value. With this unique words will be highlighted using the Max IDF values.

In [25]:
example = text[0]
example

'Hello Kushal Bhavsar here, I love machine learning'

In [26]:
example = vect.transform([example])
print(example.toarray())

[[0.4078241  0.4078241  0.4078241  0.         0.4078241  0.29017021
  0.4078241  0.29017021 0.         0.         0.        ]]


### Here, 0 is present in the which indexed word, which is not available in given sentence.

## PassiveAggressiveClassifier

### Passive: if correct classification, keep the model; Aggressive: if incorrect classification, update to adjust to this misclassified example.

Passive-Aggressive algorithms are generally used for large-scale learning. It is one of the few ‘online-learning algorithms‘. In online machine learning algorithms, the input data comes in sequential order and the machine learning model is updated step-by-step, as opposed to batch learning, where the entire training dataset is used at once. This is very useful in situations where there is a huge amount of data and it is computationally infeasible to train the entire dataset because of the sheer size of the data. We can simply say that an online-learning algorithm will get a training example, update the classifier, and then throw away the example.

## Let's start the work

In [27]:
import pandas as pd

In [28]:
dataframe = pd.read_csv('IFND.csv')
dataframe.head()

Unnamed: 0,id,Statement,Category,Label
0,2,"WHO praises India's Aarogya Setu app, says it ...",COVID-19,True
1,3,"In Delhi, Deputy US Secretary of State Stephen...",VIOLENCE,True
2,4,LAC tensions: China's strategy behind delibera...,TERROR,True
3,5,India has signed 250 documents on Space cooper...,COVID-19,True
4,6,Tamil Nadu chief minister's mother passes away...,ELECTION,True


In [29]:
x = dataframe['Statement']
y = dataframe['Label']

In [30]:
x

0        WHO praises India's Aarogya Setu app, says it ...
1        In Delhi, Deputy US Secretary of State Stephen...
2        LAC tensions: China's strategy behind delibera...
3        India has signed 250 documents on Space cooper...
4        Tamil Nadu chief minister's mother passes away...
                               ...                        
56709    Fact Check: This is not Bruce Lee playing ping...
56710    Fact Check: Did Japan construct this bridge in...
56711    Fact Check: Viral video of Mexico earthquake i...
56712    Fact Check: Ballet performance by Chinese coup...
56713    Fact Check: Is this little boy crossing into J...
Name: Statement, Length: 56714, dtype: object

In [31]:
y

0        TRUE
1        TRUE
2        TRUE
3        TRUE
4        TRUE
         ... 
56709    Fake
56710    Fake
56711    Fake
56712    Fake
56713    Fake
Name: Label, Length: 56714, dtype: object

In [32]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

In [33]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)
y_train

53877    Fake
8242     TRUE
40336    Fake
42849    Fake
54574    Fake
         ... 
45891    Fake
52416    Fake
42613    Fake
43567    Fake
2732     TRUE
Name: Label, Length: 45371, dtype: object

In [34]:
y_train

53877    Fake
8242     TRUE
40336    Fake
42849    Fake
54574    Fake
         ... 
45891    Fake
52416    Fake
42613    Fake
43567    Fake
2732     TRUE
Name: Label, Length: 45371, dtype: object

In [35]:
tfvect = TfidfVectorizer(stop_words='english',max_df=0.7)
tfid_x_train = tfvect.fit_transform(x_train)
tfid_x_test = tfvect.transform(x_test)

* max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
* max_df = 25 means "ignore terms that appear in more than 25 documents".

In [36]:
classifier = PassiveAggressiveClassifier(max_iter=50)
classifier.fit(tfid_x_train,y_train)

PassiveAggressiveClassifier(max_iter=50)

In [37]:
y_pred = classifier.predict(tfid_x_test)
score = accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 93.4%


In [38]:
def fake_news_det(news):
    input_data = [news]
    vectorized_input_data = tfvect.transform(input_data)
    prediction = classifier.predict(vectorized_input_data)
    print(prediction)

In [39]:
fake_news_det('U.S. Secretary of State John F. Kerry said Monday that he will stop in Paris later this week, amid criticism that no top American officials attended Sundayâ€™s unity march against terrorism.')

['TRUE']


In [40]:
fake_news_det("""Go to Article 
President Barack Obama has been campaigning hard for the woman who is supposedly going to extend his legacy four more years. The only problem with stumping for Hillary Clinton, however, is sheâ€™s not exactly a candidate easy to get too enthused about.  """)

['Fake']


In [41]:
import pickle
pickle.dump(classifier,open('model.pkl', 'wb'))

In [42]:
# load the model from disk
loaded_model = pickle.load(open('model.pkl', 'rb'))

In [43]:
def fake_news_det1(news):
    input_data = [news]
    vectorized_input_data = tfvect.transform(input_data)
    prediction = loaded_model.predict(vectorized_input_data)
    print(prediction)

In [44]:
fake_news_det1("""Go to Article 
President Barack Obama has been campaigning hard for the woman who is supposedly going to extend his legacy four more years. The only problem with stumping for Hillary Clinton, however, is sheâ€™s not exactly a candidate easy to get too enthused about.  """)

['Fake']


In [45]:
fake_news_det1("""U.S. Secretary of State John F. Kerry said Monday that he will stop in Paris later this week, amid criticism that no top American officials attended Sundayâ€™s unity march against terrorism.""")

['TRUE']


In [46]:
fake_news_det('''U.S. Secretary of State John F. Kerry said Monday that he will stop in Paris later this week, amid criticism that no top American officials attended Sundayâ€™s unity march against terrorism.''')

['TRUE']
