# Part 1 : Natural Language Processing

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split

%matplotlib inline

## 1.1 Data

* In this lecture, we use the SMS Spam Collection Data Set from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). 
    * A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site.
    * A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore.

In [None]:
#df_sms = pd.read_csv('./SMS_Spam.tsv', sep='\t')

import io
from google.colab import files
uploaded = files.upload()

df_sms = pd.read_csv(io.StringIO(uploaded['SMS_Spam.tsv'].decode('utf-8')), sep='\t')

In [None]:
df_sms.head()

## 1.2 Exploratory Data Analysis
* First, how many messages the data have?

In [None]:
len(df_sms)

* Then, now, how many spams and hams each other?

In [None]:
df_sms['label'].value_counts()

* Now, let's apply lengths of each message and create a new column.

In [None]:
df_sms['length'] = df_sms['message'].apply(len)

In [None]:
df_sms.head()

* How are the lengths of messages distributed?

In [None]:
sns.displot(df_sms['length'])
plt.show()

* Are there any differences of the distribution of spam and ham messages?

In [None]:
df_spam = df_sms[df_sms['label']=='spam'].reset_index(drop=True)
df_ham = df_sms[df_sms['label']=='ham'].reset_index(drop=True)

In [None]:
plt.figure(figsize=(15,10))

sns.displot(df_spam['length'], color='red')
sns.displot(df_ham['length'], color='blue')
plt.legend(labels=['spam','ham'])
plt.show()

## 1.3 Text preprocessing
* For analyzing texts, we need to split each message into individual words.
* Let's remove punctuations first.
    * Python's built-in library **string** would provide a quick and convenient way of removing them.

In [None]:
import string

string.punctuation

* Check characters whether they are punctuations or not.

In [None]:
sample = "Hello! This is SK HLP: Data Literacy lecture."

In [None]:
sample_nopunc = []
for char in sample:
    if char not in string.punctuation:
        sample_nopunc.append(char)

In [None]:
sample_nopunc = "".join(sample_nopunc)

* Now, it's a step to remove stopwords. The NLTK library is a kind of stardard library for processing texts in Python (https://www.nltk.org/).
* The NLTK library provide a list of stopwords.

In [None]:
import nltk
from nltk.corpus import stopwords

* We can specify a language for stopwords list.

In [None]:
nltk.download('stopwords')

In [None]:
stopwords.words('english')

* Split the message and remove stopwords according to the list.

In [None]:
sample_nopunc

In [None]:
sample_nopunc.split()

In [None]:
remove_stopwords = []
for word in sample_nopunc.split():
    if word.lower() not in stopwords.words('english'):
        remove_stopwords.append(word)

In [None]:
remove_stopwords

* When you make a function for this, it would be more useful to apply it later.

In [None]:
def preprocessing(text):
    
    # remove punctuation
    nopunc = []
    for char in text:
        if char not in string.punctuation:
            nopunc.append(char)
            
    nopunc = "".join(nopunc)
    
    # remove stopwords
    remove_stop = []
    for word in nopunc.split():
        if word.lower() not in stopwords.words('english'):
            remove_stop.append(word)
            
    # remove words less than three characters
    tokens = []
    for word in remove_stop:
        if len(word) >= 3:
            tokens.append(word)
            
    #tokens = " ".join(tokens)
    
    return tokens

In [None]:
sample

In [None]:
preprocessing(sample)

* You can apply the preprocessing function to whole dataframe.

In [None]:
df_sms.head()

In [None]:
df_sms['message'].apply(preprocessing)

## 1.4 Frequency Analysis

In [None]:
clean_spam = df_spam['message'].apply(preprocessing)
clean_ham = df_ham['message'].apply(preprocessing)

* First, let's merge whole values of each dataframe into one list.

In [None]:
whole_spam = []
for line in clean_spam.tolist():
    whole_spam += line
    
whole_ham = []
for line in clean_ham.tolist():
    whole_ham += line

* The **Text** class in **NLTK** library provide some useful methods to text analysis.

In [None]:
from nltk import Text

ham_text = Text(whole_ham)
spam_text = Text(whole_spam)

* The **vocab** method in the **Text** class can extract the frequency of usage for each token.

In [None]:
freqDist_ham = ham_text.vocab()

In [None]:
freqDist_ham.most_common(10)

* How about spam messages?

In [None]:
freqDist_spam = spam_text.vocab()
freqDist_spam.most_common(10)

* You can plot the distribution of each token by the **plot** method.

In [None]:
plt.figure(figsize=(10,8))

ham_text.plot(30)
plt.show()

* We can also use the **wordcloud** package for visualization. 
* You can download the package by `conda install -c conda-forge wordcloud`

In [None]:
from wordcloud import WordCloud

plt.figure(figsize=(15,10))

wc_ham = WordCloud(width=1000, height=600, background_color="black", random_state=0)
plt.imshow(wc_ham.generate_from_frequencies(freqDist_ham))
plt.axis("off")
plt.show()

# Part 2 : Recommendation System

* Recommendation system is a sort of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item. They are primarily used in commercial applications (https://en.wikipedia.org/wiki/Recommender_system)
* There are two common types of recommender systems:
    * **Content-Based Filtering** focus on the attributes of the items and give you recommendations based on the similarity between them.
    
    * **Collaborative Filtering** produces recommendations based on the user's attitude (activity) to items.


* Movie recommendation is one of the first step to start learning recommendation systems.
* MovieLens dataset is a famous one for learning to build the recommendation systems.
    * https://grouplens.org/datasets/movielens/
    * https://kaggle.com/grouplens/movielens-20m-dataset

In [None]:
#ratings = pd.read_csv('./ratings.csv')

uploaded = files.upload()
ratings = pd.read_csv(io.StringIO(uploaded['ratings.csv'].decode('utf-8')))

In [None]:
ratings.head()

In [None]:
#movies = pd.read_csv('./movies.csv')

uploaded = files.upload()
movies = pd.read_csv(io.StringIO(uploaded['movies.csv'].decode('utf-8')))

In [None]:
movies.head()

* Let's first merge those two dataframes.

In [None]:
df_movies = pd.merge(ratings, movies, on='movieId')

In [None]:
df_movies.head()

* Which movie has the highest user ratings on average?

In [None]:
ratings_sort = df_movies.groupby('title')['rating'].mean().sort_values(ascending=False)

In [None]:
ratings_sort

* Which movies received the most ratings from users?

In [None]:
counting_sort = df_movies.groupby('title')['rating'].count().sort_values(ascending=False)
counting_sort

* Let's combine of those two results.

In [None]:
movie_ratings = pd.DataFrame(df_movies.groupby('title')['rating'].mean())
movie_ratings['numbers'] = pd.DataFrame(df_movies.groupby('title')['rating'].count())
movie_ratings.head()

* Now, reshape the dataframe with using pivot_table.

In [None]:
user_movie_matrix = df_movies.pivot_table(index='userId', columns='title', values='rating')
user_movie_matrix

* Fill the NaN values to 0.

In [None]:
user_movie_matrix.fillna(0, inplace=True)
user_movie_matrix.head()

* Let's take two examples of movies.

In [None]:
Matrix = user_movie_matrix['Matrix, The (1999)']
Matrix.head(10)

In [None]:
Terminator = user_movie_matrix['Terminator 2: Judgment Day (1991)']
Terminator.head(10)

* How similar with those two movies?

In [None]:
Matrix.corr(Terminator)

* Which movie is the most similar with the "Matrix, The (1999)"?

In [None]:
Matrix_corr = pd.DataFrame(user_movie_matrix.corrwith(Matrix), columns=['correl'])
Matrix_corr