# <font color='SEAGREEN'>Day 2</font>
# <font color='MEDIUMSEAGREEN'>Exploring the Data</font>

We can’t communicate with computers the same way we communicate to humans. 

- We need to do some extra work to describe a text to a computer.
- We have to pick and choose what about that text we want to tell our computers.

The measurable properties that we use to describe our examples/samples in our dataset are called ***features***.

Today we will learn how we can extract these features.

## Text Representation
The most common way to model documents is to transform them into sparse numeric vectors. This representation is called “Bag of Words”. 

Today we will learn two methods:
1. Vector space model
2. TF-IDF

### 1. Vector Space Model

In the vector space model, we are given a set of documents D. Each document is a set of words. The goal is to convert these textual documents to feature vectors. We can represent document $i$ with vector $d_i$.

$$d_i = (w_{1,i}, w_{2,i}, ... , w_{N,i}),$$ 

where N is the number of words used for vectorization and $w_{j,i}$ is the weight for word j that occurs in document i.

Design choice for weights:
    1. We can set it to 1 when the word j exists in document i and 0 when it does not.
    2. We can also set this weight to the number of times the word j is observed in document i.
**Example 1:** Suppose we have 3 documents d1, d2 and d3 in our dataset with the following contents:
    - d1: It was the best of times
    - d2: It was the worst of times
    - d3: It was the age of wisdom
    - d4: It was the age of foolishness
(The above sentences are the first few lines of the book “A Tale of Two Cities” by Charles Dickens)

How many distinct words does the dataset has?

In [None]:
# Answer: 

Create a table that the columns are the distinct words of the documents and the rows are the number of the documents (Number of columns should be your calculated number from previous question and number of rows are 4, since we only have four documents).
Fill out the cells of the table based on the above design choices.

*Show your work to the instructor.*

One problem with the above method is that commonly used words (such as “the”, “a”, “an”, “and”) will get more weight than other words. So one pre-processing on data is to delete these common words from the vector representation.

How many stop words does the dataset in example 1 has? List them below.

In [None]:
# Answer: 

Create similar tables without the stopwords.

*Show your work to the instructor.*

### 2. TF-IDF
A more generalized approach than the vector space model is to use the **term frequency-inverse document frequency (TF-IDF)** weighting scheme. In the TF-IDF scheme, $w_{j,i}$ is calculated as,

$$w_{j,i} = tf_{j,i} × idf_j$$

where $tf_{j,i}$ is the frequency of word j in document i. $idf_j$ is the inverse TF-IDF frequency of word j across all documents,

$$ idf_j = \log_2\frac{|D|}{|\{document \in D : j \in document|\}}$$

which is the logarithm of the total number of documents divided by the number of documents that contain word j. TF-IDF assigns higher weights to words that are less frequent across documents and, at the same time, have higher frequencies within the document they are used. This guarantees that words with high TF-IDF values can be used as representative examples of the documents they belong to and also, that stop words, which are common in all documents, are assigned smaller weights.

**Example 2:** Consider the words “apple” and “orange” that appear 10 and 20 times in document d1. Let |D| = 20 and assume the word “apple” only appears in document d1 and the word “orange” appears in all 20 documents. Calculate the TF-IDF values for “apple” and “orange” in document d1.

In [None]:
# tf-idf("apple", d1):
# tf-idf("orange", d1):

**Exercise:**

Consider the table you provided in *Example 1* (without stopwords). This table is the tf values. 
- Calculate the idf values for each word.
- The TF-IDF values can be computed by multiplying tf values with the idf values (The result should be a table).

*Show your work to the instructor.*

After vectorization, documents are converted to vectors, and common data mining algorithms can be applied.

## Dataset Pre-processing
Now lets go back to our dataset and create the tf-idf values for our documents.

First we need a list of stopwords. Don't worry, you don't need to think about different possible stopwords. Python has a library called "nltk" which stands for *Natural Language Toolkit* that has a list of stopwords stored in 16 different languages.

To check the list of stopwords you can run the following commands.

In [None]:
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
stop_words

You can even modify the list by adding words of your choice to the ``stop_words`` set.

Now consider we want to remove the stop words from the following text:

    Through the Mayo Clinic and ASU Alliance for Health Care Faculty Summer Residency Program, six professors from the College of Health Solutions and Ira A. Fulton Schools of Engineering will spend six weeks working side by side with Mayo Clinic researchers at a Mayo Clinic site in either Rochester, Minnesota, or locally in Phoenix or Scottsdale. The teams will collaborate on research that seeks to have a direct impact on patient outcomes and experiences. This year’s cohort is tackling questions relating to Alzheimer’s disease, Type 1 diabetes, liver disease and more.

First, we need to convert the text to list of words. nltk helps us to do so by tokenizing our sentence into words.

In [None]:
from nltk.tokenize import RegexpTokenizer

sentence = "Through the Mayo Clinic and ASU Alliance for Health Care Faculty Summer Residency Program, six professors from the College of Health Solutions and Ira A. Fulton Schools of Engineering will spend six weeks working side by side with Mayo Clinic researchers at a Mayo Clinic site in either Rochester, Minnesota, or locally in Phoenix or Scottsdale. The teams will collaborate on research that seeks to have a direct impact on patient outcomes and experiences. This year’s cohort is tackling questions relating to Alzheimer’s disease, Type 1 diabetes, liver disease and more."
sentence = sentence.lower()
tokenizer = RegexpTokenizer(r'\w+')
words = tokenizer.tokenize(sentence)
words

Now, It is time to remove the stopwords from the tokenized words.

In [None]:
words_wo_stopwords = [] 
  
for w in words: 
    if w not in stop_words: 
        words_wo_stopwords.append(w)
print(words_wo_stopwords)

Lets do the same for our data.

RECALL: you saved your data in ``data.csv`` file.

load your data:

In [None]:
# Your code

Concatenate all the texts from all the articles together (because we want to find all the words used in our dataset). Use print to print out your result.

In [None]:
# Your Code

Tokenize the data and remove the stopwords. Don't forget to lower case all the words. Delete the duplicates from the list (Search and see how you can do this). Print out the number of the words before deleting the stopwords, before deleting the duplicates and at the end.

In [None]:
# Your code

If you want to create the tf table, what would be the number of columns and the number of the rows?

In [None]:
# Your answer: 

Create the tf table.
    - Create an empty dataframe filled with values zero.
    - Count the words in each article and fill out the created dataframe.

In [None]:
# Your code

Find the idf values for each word. To use logarithm in python you should import math library. 

``math.log(a,Base)``

In [None]:
# Your code

Calculate the TF-TDF values.

In [None]:
# Your code

Export your results to ``X.csv``

In [None]:
# Your Code

## Research
Is there any way in python that you can improve the collection of the words?

For example, the two words "<font color='INDIGO'>dogs</font>" and "<font color='INDIGO'>dog</font>" or "<font color='ORCHID'>walk</font>" and "<font color='ORCHID'>walking</font>" are not that different, but the above method counts them as two seperate words.

HINT: The process of producing morphological variants of a root/base word is called *Stemming*. *Lemmatization* is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.

Write down the result of your research in the below block:

In [None]:
# Your answer:

**Advanced Exercise:** We can calculate the tf-idf values with scikit-learn. Find out how we can do that and implement it.

In [None]:
# Your code

In [None]:
print("Nice work today!")

References:
    - Zafarani, Reza, Mohammad Ali Abbasi, and Huan Liu. Social media mining: an introduction. Cambridge University Press, 2014.