# Assignment 1 - Text Preprocessing and Similarity
This assignment will focus on text preprocessing and the calculation of text similarity through TF-IDF. Please put the data file `kickstarter_desc_sample.csv` under the same directory with this notebook. 

The code below will load the data and select a random sample for subsequent tasks. Each student will work on a different random sample. Then it will display how the data looks like in your sample. 

The code in the notebook primarily uses `pandas` dataframe. You can use `df['column'].apply(function)` to apply any preprocessing function directly on pandas dataframe.

The data was collected from the a crowdfunding website, Kickstarter (<a>www.kickstarter.com</a>). It is basically about entrepreneurial fundraising campaigns. The sample contains around five thousand project descriptions. For each row in the data, there is a `project_id`, the unique identifier of the project, and a column of textual project description (`description_str`).

In [None]:
import pandas as pd
import numpy as np
descriptions = pd.read_csv('./kickstarter_desc_sample.csv')
descriptions = descriptions.sample(1000)
import platform
platform.uname()

In [None]:
pd.set_option('display.max_colwidth', 200)
descriptions.head()

## Task 1 Text Preprocessing
In Task 1 you will preprocess the text. The text has lots of noises and unstructured terms so before performing subsequent analysis, you need to clean them first.

In this task, you will implement (1) `tokenizer` (2) `stop words removal` (3) `lemmatizer` to preprocess the texts in the column `description_str`. Eventually, the cleaned text should be saved as a separate column in the `descriptions` dataframe with the name `description_clean`.

First make sure you download the NLTK corpora.

### `Task 1.1` Tokenizer
Implement tokenizer to tokenize the text and remove special characters in the text

In [None]:
import re
from nltk.tokenize import RegexpTokenizer

def tokenizer(description):
    #tokenize to extract words
    description = description.lower()
    description = re.sub('[^a-zA-Z0-9]', ' ', description)
    tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
    words = tokenizer.tokenize(description)
    tokenized_words = ' '.join(words)
    return tokenized_words

In [None]:
descriptions['description_clean'] = descriptions['description_str'].apply(tokenizer)
descriptions.head()

### `Task 1.2` Stopwords
Implement stop words removal to remove english stop words

In [None]:
from nltk.corpus import stopwords

def remove_stopwords(words):
    #Your code to implement stopwords removal process in column description_clean
    
    

In [None]:
descriptions['description_clean'] = descriptions['description_clean'].apply(remove_stopwords)
descriptions.head()

### `Task 1.3` Lemmatizer
Implement lemmatizer on the tokens in the text using WordNet lemmatizer with POS tags

In [None]:
from nltk.corpus import wordnet

def get_part_of_speech_tags(token):
    
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    
    tag = nltk.pos_tag([token])[0][1][0].upper()
    
    return tag_dict.get(tag, wordnet.NOUN)

In [None]:
#[valid]
from nltk.stem import WordNetLemmatizer

def postag_lemmentization(words):
    #Your code to implement WordNet lemmatizer with POS tags in column description_clean
    

In [None]:
descriptions['description_clean'] = descriptions['description_clean'].apply(postag_lemmentization)
descriptions.head()

The following code will do some further cleaning by removing all the numbers in the texts.

In [None]:
descriptions['description_clean'] = descriptions['description_clean'].str.replace('\d+', '')
descriptions.head()

## Task 2 Use TF-IDF to represent text and calculate similarity
In the second task, you will calculate cosine similarity from TF-IDF representation of document. 

In the following code, the `descriptions` dataframe will be randomly split into one main sample and one small test sample (3 project descriptions). 

Then what you are going to do are: 

(1) build TF-IDF vector representation from `desc_main` sample with `description_clean` column and show the shape of TF-IDF matrix; 

(2) for each of the three project descriptions in the `desc_test`, find the most similar project from `desc_main` sample.

In [None]:
desc_test = descriptions.sample(3)
desc_main = descriptions.drop(desc_test.index)

### `Task 2.1` TF-IDF
Build TF-IDF vector representation for the project descriptions in `desc_main` and show the `shape` of TF-IDF matrix. When building TF-IDF matrix, only use up to top `500` terms. Use CountVectorizer and TF-IDF transformer to build the matrices. Consider `9.Chatbot.ipynb` as a reference.

In [None]:
#Your code to build TF-IDF vectorizer from desc_main['description_clean']



### `Task 2.2` Search Similar Project

Now you need to find the project_ids and descriptions in `desc_test`. The code below shows the three projects in that dataframe.

In [None]:
desc_test.head()

For these three projects in `desc_test`, find the most similar project for each of them in `desc_main`. Print the `project_id` of the most similar project description and the corresponding `cosine similarity`. 

Edit the headings below to replace `project_id` with the three `project_id`s you get and then implement your code.

#### Get the most similar project for `project_id`

Using this line of code to get project description with a given project_id: 

`project_desc = desc_test[desc_test['project_id']==1617375102]['description_clean'].values`

In [None]:
#Your code to get the most similar project for project 1 from the TF-IDF vectorizer


#Print the similarity score and corresponding project_id



#### Get the most similar project for `project_id`

In [None]:
#Your code to get the most similar project for project 2 from the TF-IDF vectorizer 


#Print the similarity score and corresponding project_id



#### Get the most similar project for `project_id`

In [None]:
#Your code to get the most similar project for project 3 from the TF-IDF vectorizer 


#Print the similarity score and corresponding project_id


