## Importing Libraries

In [7]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
# TF-IDF
from sklearn.svm import LinearSVC

### TfidVectorizer? 
#### TfidfVectorizer is a class in scikit-learn that is used to transform text data into a numerical representation suitable for machine learning algorithms. It calculates the Term Frequency-Inverse Document Frequency (TF-IDF) of each word in a document.
#### - It converts text data into a numerical format that can be used by machine learning algorithms.
#### - It can be used to extract relevant features from text data, such as the most important words or phrases.
#### - It assigns weights to words based on their importance, which can help improve the performance of machine learning models.

## Importing Dataset

In [8]:
data = pd.read_csv('fake_or_real.csv')
data.head()

Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [9]:
data['Fake'] = data['label'].apply(lambda x: 0 if x=="REAL" else 1)
data = data.drop('label',axis=1)
data.head()

Unnamed: 0,id,title,text,Fake
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",1
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,1
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,0
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",1
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,0


#### Created a new column 'Fake' and assigned value of 0,1: REAL == 0 and FAKE == 1
#### Deleted the 'label' column

## Assigning the value of X and Y :

In [10]:
X = data['text']
y = data['Fake']

## Spliting the Dataset into Train and Test Data 
### (Test Data size = 20%)

In [18]:
X_train , X_test , y_train , y_test = train_test_split( X , y , test_size = 0.2 )

In [19]:
X_train

3475    Prev post Page 1 of 4 Next \nNurses are among ...
3558    VIDEOS If Clinton goes down, Loretta Lynch wil...
688     A guide to the Paradoxroutine page: 1 Hah I'm ...
1382    Donald Trump, defying the pundits and polls to...
3717    0 9 0 0 The US is training Syrian opposition f...
                              ...                        
2934    Republican presidential candidate Donald Trump...
4723    Washington (CNN) Compromises on some of the cr...
5863    Soon, much of the empire’s Christian Armenian ...
6137    SANFORD, Fla.—The struggle for the White House...
2227    " 'Your armed forces on Monday carried out foc...
Name: text, Length: 5068, dtype: object

In [20]:
len(X_train)


5068

In [21]:
len(X_test)

1267

In [22]:
X_test

772     A gay man is selling beer during the NBA playo...
3271    Get short URL 0 17 0 0 Another tremor of magni...
2967    On the day ISIS-related terrorists spread out ...
2435    In a highly controversial decision, the Louisi...
699     Bias bashers More Beer, Less Vodka as Russians...
                              ...                        
793     Illegal immigrants—along with other noncitizen...
5935    Another Trump Surrogate Admits Trump Won’t Bui...
1578    Based on the possibility that you will become ...
3635      \nEd wasn’t excited about his job. He worked...
1516    Share This \nDespite being dead for over 7 yea...
Name: text, Length: 1267, dtype: object

## Vectorizing the Dataset

In [29]:
vectorizer = TfidfVectorizer(stop_words='english',max_df=0.7)
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

#### stop_words='english': It removes common English words (stop words) from the text data before vectorization. 
#### max_df=0.7: This sets the maximum document frequency for a term to 0.7. This means that terms that appear in more than 70% of the documents will be ignored. 
#### X_train_vectorized = vectorizer.fit_transform(X_train): It fits the TfidfVectorizer to the training data (X_train) and transforms it into a TF-IDF matrix. The fit_transform method learns the vocabulary from the training data and then transforms the data into a sparse matrix where each row represents a document and each column represents a term.
#### X_test_vectorized = vectorizer.transform(X_test): It transforms the testing data (X_test) using the vocabulary learned from the training data. The transform method uses the same vocabulary as the training data to convert the testing data into a TF-IDF matrix with the same dimensions.

In [30]:
X_train_vectorized

<5068x61836 sparse matrix of type '<class 'numpy.float64'>'
	with 1324959 stored elements in Compressed Sparse Row format>

In [31]:
clf = LinearSVC()
clf.fit(X_train_vectorized , y_train)

#### LinearSVC() implements a linear support vector machine (SVM) for classification tasks. It's best for text classification problems due to its efficiency and effectiveness with high-dimensional sparse data (like TF-IDF features).
#### clf.fit(X_train_vectorized, y_train): It trains the LinearSVC classifier on the training data.

## Checking the Accuracy of the Model

In [32]:
clf.score(X_test_vectorized , y_test)

0.9494869771112865

In [33]:
len(y_test)

1267

In [36]:
len(y_test) * 0.9494

1202.8898

### As you can see, we got an 95% accuracy rate with the model. Additionally, after checking on the dataset we correctly predicted 1202 text data / out of 1267.