## Importing Packages and NLP related data.

In [1]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

1. **Regular Expression (re)**:
   - Used for searching text in documents. It provides a powerful way to search and manipulate strings based on patterns.

2. **NLTK**:
   - Stands for Natural Language Toolkit. It is a library in Python used for processing and analyzing human language data.

3. **PorterStemmer**:
   - A stemming algorithm used to reduce words to their root form. For example, the root word for "learning," "learner," and "prelearning" is "learn." Stemming involves removing prefixes and suffixes from words to get the base form.

4. **TfidfVectorizer**:
   - Used to transform text data into feature vectors. It converts a collection of raw documents into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features.

5. **Stopwords**:
   - Words that do not add significant meaning to a sentence and are often filtered out during text processing. Examples include "the," "on," "is," etc.


In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Data Preprocessing

In [3]:
df = pd.read_csv(r'C:\Users\Lenovo\Downloads\Fake-News-Prediction-main\data\train.csv', header = 0)

In [4]:
df.shape

(20800, 5)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


* The above infos clearly show that there are missing values in title, author, text.

### Missing Value Imputation

In [6]:
df = df.fillna(' ')

In [7]:
df['describe'] = df['title'] + ' ' + df['author']

In [8]:
x = df.loc[:, df.columns != 'label']

In [9]:
y = df['label']

## Stemming

In [10]:
port_stem = PorterStemmer()

In [11]:
def stemming(argument):
    stemmed_argument = re.sub('[^a-zA-Z]', ' ', argument)
    stemmed_argument = stemmed_argument.lower()
    stemmed_argument = stemmed_argument.split()
    stemmed_argument = [port_stem.stem(word) for word in stemmed_argument if not word in stopwords.words('english')]
    stemmed_argument = ' '.join(stemmed_argument)
    return stemmed_argument

Explanation of the function:

1. **Line 1**:
   - It differentiates between alphabets and all other characters, i.e., it considers only characters a-z and A-Z from the text, and replaces any other character with a space.

2. **Line 2**:
   - It converts all letters to lowercase.

3. **Line 3**:
   - It splits the text into a list of words.

4. **Line 4**:
   - It stems each word, excluding stopwords.

5. **Line 5**:
   - It joins all the words into a single string using spaces.

6. **Line 6**:
   - Returns the processed result.


In [12]:
df['describe'] = df['describe'].apply(stemming)

## Variable Extraction and Data Vectorization.

In [13]:
x = df['describe'].values
y = df['label'].values

* Data Vectorization

In [14]:
vectorizer = TfidfVectorizer()

In [15]:
vectorizer.fit(x)
x = vectorizer.transform(x)

### Train-Test Split

In [16]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

## Training the Model and Evaluating it.

In [17]:
from sklearn.linear_model import LogisticRegression
model_logi = LogisticRegression()

In [18]:
model_logi.fit(x_train, y_train)

In [19]:
train_pred_logi = model_logi.predict(x_train)
from sklearn.metrics import accuracy_score
accuracy_score(y_train, train_pred_logi)

0.9876201923076923

In [20]:
test_pred_logi = model_logi.predict(x_test)
accuracy_score(y_test, test_pred_logi)

0.9783653846153846

In [21]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,test_pred_logi)

array([[1975,   71],
       [  19, 2095]], dtype=int64)

* The model seems pretty good, with a excellent accuracy of 0.9783653846153846 on test_set.