# Assignment 1 - Text Preprocessing and Similarity
This assignment will focus on text preprocessing and the calculation of text similarity through TF-IDF. Please put the data file `kickstarter_desc_sample.csv` under the same directory with this notebook. 

The code below will load the data and select a random sample for subsequent tasks. Each student will work on a different random sample. Then it will display how the data looks like in your sample. 

The code in the notebook primarily uses `pandas` dataframe. You can use `df['column'].apply(function)` to apply any preprocessing function directly on pandas dataframe.

The data was collected from the a crowdfunding website, Kickstarter (<a>www.kickstarter.com</a>). It is basically about entrepreneurial fundraising campaigns. The sample contains around five thousand project descriptions. For each row in the data, there is a `project_id`, the unique identifier of the project, and a column of textual project description (`description_str`).

In [1]:
import pandas as pd
import numpy as np
descriptions = pd.read_csv('./kickstarter_desc_sample.csv')
descriptions = descriptions.sample(1000)
import platform
platform.uname()

uname_result(system='Darwin', node='Lusis-MacBook-Pro-M1.local', release='22.5.0', version='Darwin Kernel Version 22.5.0: Mon Apr 24 20:52:24 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T6000', machine='arm64')

In [2]:
pd.set_option('display.max_colwidth', 200)
descriptions.head()

Unnamed: 0,project_id,description_str
3594,367456610,Check us out! We are in the news!We have some big-name media outlets who are excited to hear the stories of the entrepreneurs we meet on the tour and are going to provide great coverage over the ...
3324,1179277958,"---- Update Oct 1st ----Most of the components to build ControLeo are in our hands now. We will be going ahead with production regardless of the outcome of this campaign. Place your pledge now, ..."
1722,1098848908,"doljiggu [doljikku] - The Straight-Shooting, no Frills Narrativeannyeong yeoreobun [annyoung yorobun]. Hey all. My name is Peter Liptak. Author of the first edition of As much as a Rat's Tail - Ko..."
405,187763644,"What is MUSIC CITY?MUSIC CITY is a narrative feature length film set in New York City. It's a love note to New York, with drama, laughs, inspiration, and characters drawn from real life. And the m..."
3972,1674439151,"When I first heard about this film project, I automatically had to get involved. I was blown away by the script, the lyrics, the music, the storyboard and concept art that you can imagine how bumm..."


## Task 1 Text Preprocessing
In Task 1 you will preprocess the text. The text has lots of noises and unstructured terms so before performing subsequent analysis, you need to clean them first.

In this task, you will implement (1) `tokenizer` (2) `stop words removal` (3) `lemmatizer` to preprocess the texts in the column `description_str`. Eventually, the cleaned text should be saved as a separate column in the `descriptions` dataframe with the name `description_clean`.

First make sure you download the NLTK corpora.

### `Task 1.1` Tokenizer
Implement tokenizer to tokenize the text and remove special characters in the text

In [3]:
import re
from nltk.tokenize import RegexpTokenizer

def tokenizer(description):
    #tokenize to extract words
    description = description.lower()
    description = re.sub('[^a-zA-Z0-9]', ' ', description)
    tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
    words = tokenizer.tokenize(description)
    tokenized_words = ' '.join(words)
    return tokenized_words

In [4]:
descriptions['description_clean'] = descriptions['description_str'].apply(tokenizer)
descriptions.head()

Unnamed: 0,project_id,description_str,description_clean
3594,367456610,Check us out! We are in the news!We have some big-name media outlets who are excited to hear the stories of the entrepreneurs we meet on the tour and are going to provide great coverage over the ...,check us out we are in the news we have some big name media outlets who are excited to hear the stories of the entrepreneurs we meet on the tour and are going to provide great coverage over the su...
3324,1179277958,"---- Update Oct 1st ----Most of the components to build ControLeo are in our hands now. We will be going ahead with production regardless of the outcome of this campaign. Place your pledge now, ...",update oct 1st most of the components to build controleo are in our hands now we will be going ahead with production regardless of the outcome of this campaign place your pledge now and if we don ...
1722,1098848908,"doljiggu [doljikku] - The Straight-Shooting, no Frills Narrativeannyeong yeoreobun [annyoung yorobun]. Hey all. My name is Peter Liptak. Author of the first edition of As much as a Rat's Tail - Ko...",doljiggu doljikku the straight shooting no frills narrativeannyeong yeoreobun annyoung yorobun hey all my name is peter liptak author of the first edition of as much as a rat s tail korean slang i...
405,187763644,"What is MUSIC CITY?MUSIC CITY is a narrative feature length film set in New York City. It's a love note to New York, with drama, laughs, inspiration, and characters drawn from real life. And the m...",what is music city music city is a narrative feature length film set in new york city it s a love note to new york with drama laughs inspiration and characters drawn from real life and the music b...
3972,1674439151,"When I first heard about this film project, I automatically had to get involved. I was blown away by the script, the lyrics, the music, the storyboard and concept art that you can imagine how bumm...",when i first heard about this film project i automatically had to get involved i was blown away by the script the lyrics the music the storyboard and concept art that you can imagine how bummed i ...


### `Task 1.2` Stopwords
Implement stop words removal to remove english stop words

In [5]:
from nltk.corpus import stopwords

def remove_stopwords(words):
    #Your code to implement stopwords removal process in column description_clean
    token_list = words.split(' ')
    stop = set(stopwords.words('english'))
    filtered_words = ' '.join([token for token in token_list if not token in stop])
    return filtered_words

In [6]:
descriptions['description_clean'] = descriptions['description_clean'].apply(remove_stopwords)
descriptions.head()

Unnamed: 0,project_id,description_str,description_clean
3594,367456610,Check us out! We are in the news!We have some big-name media outlets who are excited to hear the stories of the entrepreneurs we meet on the tour and are going to provide great coverage over the ...,check us news big name media outlets excited hear stories entrepreneurs meet tour going provide great coverage summer names coming soon jasen lee deseret news wrote great story us check http www d...
3324,1179277958,"---- Update Oct 1st ----Most of the components to build ControLeo are in our hands now. We will be going ahead with production regardless of the outcome of this campaign. Place your pledge now, ...",update oct 1st components build controleo hands going ahead production regardless outcome campaign place pledge reach target let know get hands controleo pledge know reach thanks idea behind contr...
1722,1098848908,"doljiggu [doljikku] - The Straight-Shooting, no Frills Narrativeannyeong yeoreobun [annyoung yorobun]. Hey all. My name is Peter Liptak. Author of the first edition of As much as a Rat's Tail - Ko...",doljiggu doljikku straight shooting frills narrativeannyeong yeoreobun annyoung yorobun hey name peter liptak author first edition much rat tail korean slang invective euphemism along fabulous co ...
405,187763644,"What is MUSIC CITY?MUSIC CITY is a narrative feature length film set in New York City. It's a love note to New York, with drama, laughs, inspiration, and characters drawn from real life. And the m...",music city music city narrative feature length film set new york city love note new york drama laughs inspiration characters drawn real life music breathtaking music genres focus classical film pa...
3972,1674439151,"When I first heard about this film project, I automatically had to get involved. I was blown away by the script, the lyrics, the music, the storyboard and concept art that you can imagine how bumm...",first heard film project automatically get involved blown away script lyrics music storyboard concept art imagine bummed found film production way reach suddenly pops kickstarter mind glory contac...


### `Task 1.3` Lemmatizer
Implement lemmatizer on the tokens in the text using WordNet lemmatizer with POS tags

In [7]:
from nltk.corpus import wordnet

def get_part_of_speech_tags(token):
    
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    
    tag = nltk.pos_tag([token])[0][1][0].upper()
    
    return tag_dict.get(tag, wordnet.NOUN)

In [8]:
from nltk.stem import WordNetLemmatizer
import nltk

def postag_lemmentization(words):
    #Your code to implement WordNet lemmatizer with POS tags in column description_clean
    token_list = words.split(' ')
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = ' '.join([lemmatizer.lemmatize(token, get_part_of_speech_tags(token)) for token in token_list])
    return lemmatized_words

In [9]:
descriptions['description_clean'] = descriptions['description_clean'].apply(postag_lemmentization)
descriptions.head()

Unnamed: 0,project_id,description_str,description_clean
3594,367456610,Check us out! We are in the news!We have some big-name media outlets who are excited to hear the stories of the entrepreneurs we meet on the tour and are going to provide great coverage over the ...,check u news big name medium outlet excite hear story entrepreneur meet tour go provide great coverage summer name come soon jasen lee deseret news write great story u check http www deseretnews c...
3324,1179277958,"---- Update Oct 1st ----Most of the components to build ControLeo are in our hands now. We will be going ahead with production regardless of the outcome of this campaign. Place your pledge now, ...",update oct 1st component build controleo hand go ahead production regardless outcome campaign place pledge reach target let know get hand controleo pledge know reach thanks idea behind controleoar...
1722,1098848908,"doljiggu [doljikku] - The Straight-Shooting, no Frills Narrativeannyeong yeoreobun [annyoung yorobun]. Hey all. My name is Peter Liptak. Author of the first edition of As much as a Rat's Tail - Ko...",doljiggu doljikku straight shoot frill narrativeannyeong yeoreobun annyoung yorobun hey name peter liptak author first edition much rat tail korean slang invective euphemism along fabulous co auth...
405,187763644,"What is MUSIC CITY?MUSIC CITY is a narrative feature length film set in New York City. It's a love note to New York, with drama, laughs, inspiration, and characters drawn from real life. And the m...",music city music city narrative feature length film set new york city love note new york drama laugh inspiration character drawn real life music breathtaking music genre focus classical film pay h...
3972,1674439151,"When I first heard about this film project, I automatically had to get involved. I was blown away by the script, the lyrics, the music, the storyboard and concept art that you can imagine how bumm...",first heard film project automatically get involve blown away script lyric music storyboard concept art imagine bum found film production way reach suddenly pop kickstarter mind glory contact dire...


The following code will do some further cleaning by removing all the numbers in the texts.

In [10]:
descriptions['description_clean'] = descriptions['description_clean'].str.replace('\d+', '', regex=True)
descriptions.head()

Unnamed: 0,project_id,description_str,description_clean
3594,367456610,Check us out! We are in the news!We have some big-name media outlets who are excited to hear the stories of the entrepreneurs we meet on the tour and are going to provide great coverage over the ...,check u news big name medium outlet excite hear story entrepreneur meet tour go provide great coverage summer name come soon jasen lee deseret news write great story u check http www deseretnews c...
3324,1179277958,"---- Update Oct 1st ----Most of the components to build ControLeo are in our hands now. We will be going ahead with production regardless of the outcome of this campaign. Place your pledge now, ...",update oct st component build controleo hand go ahead production regardless outcome campaign place pledge reach target let know get hand controleo pledge know reach thanks idea behind controleoard...
1722,1098848908,"doljiggu [doljikku] - The Straight-Shooting, no Frills Narrativeannyeong yeoreobun [annyoung yorobun]. Hey all. My name is Peter Liptak. Author of the first edition of As much as a Rat's Tail - Ko...",doljiggu doljikku straight shoot frill narrativeannyeong yeoreobun annyoung yorobun hey name peter liptak author first edition much rat tail korean slang invective euphemism along fabulous co auth...
405,187763644,"What is MUSIC CITY?MUSIC CITY is a narrative feature length film set in New York City. It's a love note to New York, with drama, laughs, inspiration, and characters drawn from real life. And the m...",music city music city narrative feature length film set new york city love note new york drama laugh inspiration character drawn real life music breathtaking music genre focus classical film pay h...
3972,1674439151,"When I first heard about this film project, I automatically had to get involved. I was blown away by the script, the lyrics, the music, the storyboard and concept art that you can imagine how bumm...",first heard film project automatically get involve blown away script lyric music storyboard concept art imagine bum found film production way reach suddenly pop kickstarter mind glory contact dire...


## Task 2 Use TF-IDF to represent text and calculate similarity
In the second task, you will calculate cosine similarity from TF-IDF representation of document. 

In the following code, the `descriptions` dataframe will be randomly split into one main sample and one small test sample (3 project descriptions). 

Then what you are going to do are: 

(1) build TF-IDF vector representation from `desc_main` sample with `description_clean` column and show the shape of TF-IDF matrix; 

(2) for each of the three project descriptions in the `desc_test`, find the most similar project from `desc_main` sample.

In [11]:
desc_test = descriptions.sample(3)
desc_main = descriptions.drop(desc_test.index)

### `Task 2.1` TF-IDF
Build TF-IDF vector representation for the project descriptions in `desc_main` and show the `shape` of TF-IDF matrix. When building TF-IDF matrix, only use up to top `500` terms. Use CountVectorizer and TF-IDF transformer to build the matrices. Consider `9.Chatbot.ipynb` as a reference.

In [12]:
#Your code to build TF-IDF vectorizer from desc_main['description_clean']
from sklearn.feature_extraction.text import TfidfTransformer,CountVectorizer
vectorizer = CountVectorizer(max_features=500)
tfidf =TfidfTransformer()
tf_vector = vectorizer.fit_transform(desc_main['description_clean'])
tfidf_vector = tfidf.fit_transform(tf_vector)
tfidf_vector.shape

(997, 500)

### `Task 2.2` Search Similar Project

Now you need to find the project_ids and descriptions in `desc_test`. The code below shows the three projects in that dataframe.

In [13]:
desc_test.head()

Unnamed: 0,project_id,description_str,description_clean
551,731137767,"The Finex 12"" Cast Iron Skillet is a long overdue redesign of a true American classic. \nWOW! Thanks to you, our backers, we have reached an amazing 800%+ of our goal! The Finex team and I will s...",finex cast iron skillet long overdue redesign true american classic wow thanks backer reach amaze goal finex team strive live level confidence show u product please stay contact go www finexusa ...
4450,2056203251,The Goal\nI'm determined to complete the last three paintings of my Gameboardom series of twelve large-scale paintings -- 'playable' board game parodies that are narratives on our society & cultur...,goal determine complete last three painting gameboardom series twelve large scale painting playable board game parody narrative society culture memorialize entire series meticulously craft page f...
2748,1030934342,"*** READ OUR CURRENT ISSUE: As a celebration of finishing the Kickstarter, we've made our current issue out on December 18th (#32) free to read on the Web until Friday morning. ****** FUNDING UPDA...",read current issue celebration finish kickstarter make current issue december th free read web friday morning funding update thanks everyone reach goal already pass still early bird special left ...


For these three projects in `desc_test`, find the most similar project for each of them in `desc_main`. Print the `project_id` of the most similar project description and the corresponding `cosine similarity`. 

Edit the headings below to replace `project_id` with the three `project_id`s you get and then implement your code.

#### Get the most similar project for `731137767`

Using this line of code to get project description with a given project_id: 

`project_desc = desc_test[desc_test['project_id']==1617375102]['description_clean'].values`

In [17]:
#Your code to get the most similar project for project 1 from the TF-IDF vectorizer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

project_desc = desc_test[desc_test['project_id']==731137767]['description_clean'].values
desc_tf_vector = vectorizer.transform(project_desc)
desc_tfidf_vector = tfidf.fit_transform(desc_tf_vector)

cos_sim = max(cosine_similarity(desc_tfidf_vector, tfidf_vector)[0])
desc_index = np.argmax(cosine_similarity(desc_tfidf_vector, tfidf_vector)[0])

#Print the similarity score and corresponding project_id
print(cos_sim)
print(desc_index)
print(desc_main.iloc[desc_index]['project_id'])

0.47640171237842427
868
1036906641


#### Get the most similar project for `2056203251`

In [18]:
#Your code to get the most similar project for project 2 from the TF-IDF vectorizer 
project_desc = desc_test[desc_test['project_id']==2056203251]['description_clean'].values
desc_tf_vector = vectorizer.transform(project_desc)
desc_tfidf_vector = tfidf.fit_transform(desc_tf_vector)

cos_sim = max(cosine_similarity(desc_tfidf_vector, tfidf_vector)[0])
desc_index = np.argmax(cosine_similarity(desc_tfidf_vector, tfidf_vector)[0])

#Print the similarity score and corresponding project_id
print(cos_sim)
print(desc_index)
print(desc_main.iloc[desc_index]['project_id'])

0.5470147137389407
751
742576568


#### Get the most similar project for `1030934342`

In [19]:
#Your code to get the most similar project for project 3 from the TF-IDF vectorizer 
project_desc = desc_test[desc_test['project_id']==1030934342]['description_clean'].values
desc_tf_vector = vectorizer.transform(project_desc)
desc_tfidf_vector = tfidf.fit_transform(desc_tf_vector)

cos_sim = max(cosine_similarity(desc_tfidf_vector, tfidf_vector)[0])
desc_index = np.argmax(cosine_similarity(desc_tfidf_vector, tfidf_vector)[0])

#Print the similarity score and corresponding project_id
print(cos_sim)
print(desc_index)
print(desc_main.iloc[desc_index]['project_id'])

0.6372192255437179
416
884166558
