# TF- IDF
The Bag of Words method which counts the number of words in a document. The first part of TF-IDF is term frequency which measures the number of times words appear in a document. It is sames as Bag of Words.
The second part is inverse document frequency which measures the significance of the particular word. The more documents the word appear in, the less significant that word is.

Steps to calculate the IDF are:

1. Define input
1. Calculate Term Frequency (TF)
2. Calculate Inverse Document Frequency (IDF)
3. Calculate TF-IDF
4. Normalize the results



## Define input

In [2]:
corpus = [
'I love dogs',
'I hate dogs and knitting',
'Knitting is my hobby and my passion']

corpus

['I love dogs',
 'I hate dogs and knitting',
 'Knitting is my hobby and my passion']

## Calculate Term Frequency (TF)

In [3]:
#The method is the same as Bag of Words notebook
from sklearn.feature_extraction.text import CountVectorizer

import pandas as pd
import numpy as np

count_vectorizer=CountVectorizer()

word_count=count_vectorizer.fit_transform(corpus)

# Create a dataframe and store the number of times a word appear in a document
df = pd.DataFrame(word_count.toarray(), columns = count_vectorizer.get_feature_names())
df


Unnamed: 0,and,dogs,hate,hobby,is,knitting,love,my,passion
0,0,1,0,0,0,0,1,0,0
1,1,1,1,0,0,1,0,0,0
2,1,0,0,1,1,1,0,2,1


## Calculate Inverse Document Frequency (IDF)

The formula for <B>IDF = log(N/df(t))</B> where N is the number of documents and df(t) is the document frequency of t;that is the number of documents that contain the term t. <br> 

Here we are using this formula <B>1 + log(N/df(t))</B> to calculate IDF. We are adding 1 to the above formula because the terms with 0 IDf that is terms that occur in all documents can't be ignored completely. 

For example,

Number of documents = 2<Br>
Number of documents that contain the term t = 2

Using this formula: IDF = log(N/df(t)), we get the IDF score of 0 as log of 1 is zero. But by adding 1 to this formula we can consider that word in TF-IDF calculation.



### (Example)Calculate IDF for a single word

In [5]:
a=np.log(100)# 不寫底數時默認以e爲底,i.e ln(100)

In [6]:
a

4.605170185988092

In [7]:
num_of_docs = 3
docs_containing_the_word_and = 2

# Calculate IDF
idf = np.log(num_of_docs / docs_containing_the_word_and)
idf

0.4054651081081644

In [8]:
idf = np.log(num_of_docs / docs_containing_the_word_and) + 1
idf

1.4054651081081644

## Calculate TF-IDF
TF-IDF is calculated by multiplying TF with IDF.

TF-IDF = TF * IDF

# TF-IDF(no normalize)

In [19]:
# Import TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(smooth_idf=False, norm=None, stop_words='english')

# Fit the model using fit_transform function
#fit and transform at the same time
tfidf_scores = tfidf_vectorizer.fit_transform(corpus)

# Print TF-IDF score 
df = pd.DataFrame(np.round(tfidf_scores.toarray(),2), columns = tfidf_vectorizer.get_feature_names())
df

Unnamed: 0,dogs,hate,hobby,knitting,love,passion
0,1.41,0.0,0.0,0.0,2.1,0.0
1,1.41,2.1,0.0,1.41,0.0,0.0
2,0.0,0.0,2.1,1.41,0.0,2.1


## Normalize the results
Normalize the results setting norm = l2

If norm = 'l2' (the default),each output row will have unit norm, i.e. sum of squares of vector elements is 1


if norm = 'l1', sum of absolute values of vector elements is 1


# TF-IDF(normalize)

In [20]:
tfidf_vectorizer = TfidfVectorizer(smooth_idf=False, use_idf=True, lowercase=True, norm="l2")
tfidf_scores = tfidf_vectorizer.fit_transform(corpus)
df = pd.DataFrame(np.round(tfidf_scores.toarray(),2), columns = tfidf_vectorizer.get_feature_names())
df

Unnamed: 0,and,dogs,hate,hobby,is,knitting,love,my,passion
0,0.0,0.56,0.0,0.0,0.0,0.0,0.83,0.0,0.0
1,0.44,0.44,0.65,0.0,0.0,0.44,0.0,0.0,0.0
2,0.24,0.0,0.0,0.36,0.36,0.24,0.0,0.71,0.36


# XGBoost Model

In this notebook, we will learn to use TF-IDF model and predict sentiment of the news headlines. The steps involved are:
1. Read data
2. Determine target variables
3. Create predictor variables
4. Split data into train and test 
5. Apply TF-IDF on train and test dataset
6. Run XGBoost on the train dataset
7. Predict sentiment scores on the test dataset
8. Analyse the results

### The first four steps are discussed in the bag of words to XGBoost model. 

In [21]:
# Import pandas
import pandas as pd

# Read news sentiment data
news_sentiment_data = pd.read_csv('news_headline_sentiments.csv')
news_sentiment_data.head()

# Store sentiment_class in y
y = news_sentiment_data.sentiment_class

# Store news headlines in X
X = news_sentiment_data.news_headline

# Convert X in string if the value of x is not string
X = [str(x) if type(x) != str else x for x in X]

test_ratio = 0.2
train_ratio = 1.0 - test_ratio

num_train = int(train_ratio * len(X))

# X_train and Y_train are training dataset. X_test and Y_test are testing dataset.
X_train = X[:num_train]
y_train = y[:num_train]
X_test = X[num_train:]
y_test = y[num_train:]

### The fifth step is converting train and text dataset into vectors using TF-IDF method. This step is discussed in TF-IDF model. 

In [22]:
# Import TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer and required arguments to process the data
tfidf_vectorizer = TfidfVectorizer(smooth_idf=False, use_idf=True, stop_words='english', lowercase=True)

# Fit and transform the model on train dataet
X_new_train = tfidf_vectorizer.fit_transform(X_train)

# Transform the test dataset
X_new_test = tfidf_vectorizer.transform(X_test)

### The next three step is similar to the bag of words to the XGBoost model.

In [23]:
# Import XGBClassifier from xgboost
from xgboost import XGBClassifier

# Instantiate XGBClassifier
xg = XGBClassifier(max_depth = 6, n_estimators = 100)

# Fit the model on train dataset
xg_model = xg.fit(X_new_train,y_train)

# Predict sentiment class on the test dataset
prediction = xg_model.predict(X_new_test)

# Print the model accuracy
from sklearn.metrics import accuracy_score 
print(accuracy_score(y_test,prediction))

0.7864180339549152
