## **Building a fake news classifier**

1. Here, we shall build or learn vectors from the movie plot and genre dataset
2. In short, there is a dataset full of movie plots and what genre the movie is (Action or Sci-Fi)
3. We wish to create bag of words vectors for this movie plots to see if we can predict the genre based on the words used in the plot summary
4. To do so, we shall employ the following methods from scikit-learn:
5. * Load the data
   * Define the label, y
   * Split the data into train and test
   * Create the Countervectorizer object which turns the text into **bags of words**, this is similar to a Gensim corpus. NOTE: as a pre-processing step, **ensure english stop words are removed during the formation of the bad of words****.
   * Each token will now act as feature for the classifier
   * Use the .fit_transform() method on the training data (bag_of_word object) to create the bad of words vectors.
   * Generally, fit_transform() will create the bag of words dictionary and vectors for each documents using the training data
   * Use the transformation for the training on the test data as well.
     

### **CountVectorizer for text classification**

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

df = pd.read_csv("/kaggle/input/fake-or-real-news/fake_or_real_news.csv")         #load data
print(df.head(5))
print(df.columns)

   Unnamed: 0                                              title  \
0        8476                       You Can Smell Hillary’s Fear   
1       10294  Watch The Exact Moment Paul Ryan Committed Pol...   
2        3608        Kerry to go to Paris in gesture of sympathy   
3       10142  Bernie supporters on Twitter erupt in anger ag...   
4         875   The Battle of New York: Why This Primary Matters   

                                                text label  
0  Daniel Greenfield, a Shillman Journalism Fello...  FAKE  
1  Google Pinterest Digg Linkedin Reddit Stumbleu...  FAKE  
2  U.S. Secretary of State John F. Kerry said Mon...  REAL  
3  — Kaydee King (@KaydeeKing) November 9, 2016 T...  FAKE  
4  It's primary day in New York and front-runners...  REAL  
Index(['Unnamed: 0', 'title', 'text', 'label'], dtype='object')


> Import CountVectorizer from sklearn.feature_extraction.text and train_test_split from sklearn.model_selection.
    Create a Series y to use for the labels by assigning the .label attribute of df to y.
    Using df["text"] (features) and y (labels), create training and test sets using train_test_split(). Use a test_size of 0.33 and a random_state of 53.
> 
   
>  Create a CountVectorizer object called count_vectorizer. Ensure you specify the keyword argument stop_words="english" so that stop words are removed.
> 

> Fit and transform the training data X_train using the .fit_transform() method of your CountVectorizer object. Do the same with the test data X_test, except using the .transform() method.
>
> Print the first 10 features of the count_vectorizer using its .get_feature_names() method.




In [11]:
# Create a series to store the labels: y
y = df.label

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df["text"], y, test_size = 0.33, random_state = 53)

# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words = "english")

# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names_out()[:10])

['00' '000' '0000' '00000031' '000035' '00006' '0001' '0001pt' '000ft'
 '000km']
