### Feature Engineering

In this notebook, you are going to use feature engineering techniques to prepare data in a number of different ways.  Depending on the method recommendation system that's built, you will have different data to prepare.

* **Collaborative Filtering**
    - Dense or Sparse user-item ratings matrix


* **Demographic Recommenders**
    - User information including numeric and categorical features
    
    
* **Content Based Recommenders**
    - Item information including numeric, categorical, and text related features
    
    
* **Utility Based Recommenders**
    - can use any of the features from collaborative filtering, demographic, or content data
    
    
* **Knowledge Based Recommenders**
    - uses any of the above, but includes additional user input to filter results


**Intro:** Read in and take a look at the data below. This data is available from [MovieTweetings](https://github.com/sidooms/MovieTweetings).  Following the link provides additional information about the data should you want to know more information. 

In [86]:
import numpy as np
import pandas as pd

movies_dat = pd.read_csv('./data/movies.dat', sep='::', engine='python',
                         header=None, names=['movie_id', 'movie_title', 'movie_genre'])
users_dat = pd.read_csv('./data/users.dat', sep='::', engine='python', \
                        header=None, names=['user_id', 'twitter_id'])
ratings_dat = pd.read_csv('./data/ratings.dat', sep='::', engine='python', \
                          header=None, names=['user_id', 'movie_id','rating','time'])


In [None]:
# reminder of what ratings_dat looks like
ratings_dat.head()

**Question 1:** Of the below descriptions, which is the best description of the data structure associated with **ratings_dat**?

In [None]:
import data_solution_part2 as sp

a = "ratings_dat is a sparse representation of user-item ratings"
b = "ratings_dat is a dense representation of user-item ratings"
c = "ratings_dat is neither nor dense representation of user-item ratings"

your_answer = #a

sp.answer_one(your_answer)

**Question 2:** If you created a dataframe with movie_ids as the columns, user_ids as the rows, and each value filled with the rating of that user-movie combination; this would be an example of what type of representation?

In [None]:
a = "ratings_dat is a sparse representation of user-item ratings"
b = "ratings_dat is a dense representation of user-item ratings"
c = "ratings_dat is neither nor dense representation of user-item ratings"

your_answer = #a

sp.answer_two(your_answer)

**Task:**
Use [pandas pivot_table](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html) to create the matrix described in question 2.  If you want to see how the resulting matrix should look, the `df_ratings_pivot` dataframe in the following cell holds the solution.

In [9]:
#use the ratings_dat to answer the next question
ratings_pivot = #your answer here

In [None]:
df_ratings_pivot = sp.create_pivot_df(ratings_dat)
df_ratings_pivot.head()

**Question 3:** In the previous section, you saw that the dense representation of `ratings_dat` didn't have any missing values.  However, now we have pivoted this table.  How many missing values are in this pivot of the data, and how should we deal with them in this matrix?

In [42]:
# calculate the number of missing values
df = pd.pivot_table(ratings_dat, values='rating', index='user_id', columns='movie_id')
df.head()

movie_id,18455,18578,19729,20585,20629,20927,21746,21749,22807,22883,...,11416594,11433098,11469660,11493232,11542214,11561866,11566164,11644096,11644170,11744784
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,3.0,,,5.0,4.0,...,,,,,,,,,,


In [23]:
df.isnull().sum().sum(), df.isnull().sum().iloc[0]

((df.shape[0]* df.shape[1]) - df.isnull().sum().sum())/(df.shape[0]* df.shape[1])

0.001395385576740341

In [None]:
a = (3286, 'fill missing with column average')
b = (3286, 'fill missing with row average')
c = (3286, "don't fill missing values")
d = (6384294, 'fill missing with column average')
e = (6384294, 'fill missing with row average')
f = (6384294, "don't fill missing values")
g = ("none of the above")

your_answer = #a

sp.answer_three(your_answer)

**Question 4:** Now that you have found the number of missing values in `df_ratings_pivot` (or your matching `ratings_pivot`), what percentage of the values aren't missing? (Round your answer to 4 digits following the decimal - Ex: 0.0001)

In [None]:
#use this cell for your work

In [None]:
your_answer = #0.0001

sp.answer_four(your_answer)

**Task:** The data for many recommendation systems might not have user-item ratings.  Instead, you would only know if a user interacted with an item.  In these cases, you would have a similar dense matrix (just the ratings would be removed).  

Create a 0, 1 dataframe that would represent the sparse matrix for this situation.  Store the matrix in `binary_df` below.  You can check your result against `ans_binary_df`.  

In [45]:
# your work here

binary_df = # create your df here

8921

In [None]:
# chech your df against the solution - run this cell to check
ans_binary_df = sp.create_binary_df(df)
# check that both the solution and your created df have the same stats
print(print("Your df has {} 1's and the solution has {} 1's".format(binary_df.sum().sum(), ans_binary_df.sum().sum()))
print("Your df has a shape of {} and the solution has shape: {}".format(binary_df.shape, ans_binary_df.shape))

print("Here is a header of what your df should look like:")
ans_binary_df.head()

**Extra:**  You have now created

In [47]:
df_new = df.fillna(-1)
binary_df = df_new.applymap(lambda val: 1 if val > -1 else 0)
binary_df.head()

movie_id,18455,18578,19729,20585,20629,20927,21746,21749,22807,22883,...,11416594,11433098,11469660,11493232,11542214,11561866,11566164,11644096,11644170,11744784
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,1,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0


**Question 5:** If your `user_id`s and `movie_id`s are both in order from smallest to largest, then you can run the cell below to see if your binary matrix matches the solution.  If `True` appears after running the below, they match!

In [None]:
sp.check_binary_df(binary_df, ans_binary_df)

**Extra:** 

The above shows the different manipulations of the data you would have to perform in most cases of creating a recommendation system, as most systems use Collaborative Filtering.

However, if you decide to create a recommendation system that incorporates specific user or item data, then it is important to know how to create features from categorical and text data.  

In the remaining sections of this workbook, you will be given an opportunity to test out these skills.

**Task:** Unfortunately, we aren't given any user information in this dataset.  However, we can create additional features associated with movie information.  Using `movies_dat`, create a dataframe like `new_movies_dat` below, where :
- the year of the movie is in its own column
- a separate column with either a 1 or 0 is provided for each genre-movie combo


In [52]:
# look at your dataframe here
movies_dat.head()

Unnamed: 0,movie_id,movie_title,movie_genre
0,18455,Sunrise: A Song of Two Humans (1927),Drama|Romance
1,18578,Wings (1927),Drama|Romance|War|Action
2,19729,The Broadway Melody (1929),Drama|Musical|Romance
3,20585,Why Be Good? (1929),Comedy|Drama|Romance
4,20629,All Quiet on the Western Front (1930),Drama|War


In [None]:
new_movies_dat = sp.clean_movies_dat(movies_dat)
new_movies_dat.head()

In [89]:
# your work here

In [88]:
movies_dat.head()

Unnamed: 0,movie_id,movie_title,movie_genre,year,title,Thriller,Animation,Adventure,Sci-Fi,Short,...,Crime,Action,Drama,Western,Fantasy,Mystery,Documentary,Horror,War,Family
0,18455,Sunrise: A Song of Two Humans (1927),Drama|Romance,1927,Sunrise: A Song of Two Humans,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,18578,Wings (1927),Drama|Romance|War|Action,1927,Wings,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,19729,The Broadway Melody (1929),Drama|Musical|Romance,1929,The Broadway Melody,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,20585,Why Be Good? (1929),Comedy|Drama|Romance,1929,Why Be Good?,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,20629,All Quiet on the Western Front (1930),Drama|War,1930,All Quiet on the Western Front,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [None]:
# your work here

In [None]:
# your work here

In [None]:
# your work here

**Task** In order to get some practice with text cleaning, we will need text related to movies.  One of the major pieces of text we could collect is the script for each movie.

One of the most popular websites for holding the scripts of movies is https://www.imsdb.com/.  I modified a codebase from here https://github.com/AnnaVM/Project_Plotline/ in order to scrape ~1200 movie scripts.  

By `pip` installing `selenium`, you can run the `scraping_script.py` to do the same.  You may also need to pip install `BeautifulSoup`.

If you scrape the articles, you will see the chrome browser move to each page as it is scraped, and at the end should have a path that looks like:

<img src="../../images/script_paths.png" width="200" height="400">

**Task:** Once all the articles are scraped, create `script_dict` using the function below.

In [1]:
# running this cell will create give you script_dict, which
# has a key associated with each file name, and value of the movie
# script in text
import os

def create_script_dict(script_path='./data/scraping/texts/'):
    '''
    script_path is the path to the .txt files
    
    script_dict is a dictionary that holds the .txt file name as the key
    and the contents of the file as the value
    '''
    script_dict = dict()
    movie_scripts = os.listdir(script_path)

    for script in movie_scripts:
        contents = open(script_path + script,mode='r')

        script_dict[script] = contents.read()

        contents.close()
    
    return script_dict

script_dict = create_script_dict()
script_dict[list(script_dict.keys())[0]]



**Task:** Perform common tasks in feature engineering text data:

* remove punctuation, parantheses, brackets, etc.
* make all words lower case
* remove stop words
* lemmatization of words
* create a tf-idf matrix

This process is well outlined in the article here: https://machinelearningmastery.com/clean-text-machine-learning-python/.

You will want to `pip` install `nltk` to complete this task, which may take some time to complete.

```
sudo pip install -U nltk
```

If you aren't able to perform some tasks, you may need to download the data associated with them using `.download()`.

```
import nltk
nltk.download()
```

In [2]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer 

def create_clean_script_results(script_dict):
    '''
    script_dict has keys as the the .txt file name
    and values as the text associated with the document
    '''
    clean_scripts = dict()
    
    for k, v in script_dict.items():    
        # get tokens
        tokens = word_tokenize(v)

        # create lower case words - remove punctuation
        tokens = [word.lower() for word in tokens if word.isalpha()]

        # remove stop words and lemmatize
        nltk_words = list(stopwords.words('english'))
        lem = WordNetLemmatizer()
        tokens = [lem.lemmatize(word) for word in tokens if word not in nltk_words]
        
        # store the key and string with spaces separating 
        # each of the tokens in clean_scripts
        clean_scripts[k] = tokens
        
    return clean_scripts
    
    
clean_scripts = create_clean_script_results(script_dict)
clean_scripts[list(clean_scripts.keys())[0]]

['jacket',
 'written',
 'massy',
 'tadjedin',
 'based',
 'screenplay',
 'marc',
 'rocco',
 'april',
 'pure',
 'white',
 'screen',
 'idyllic',
 'stillness',
 'looking',
 'feeling',
 'like',
 'heaven',
 'supposed',
 'second',
 'calm',
 'water',
 'seems',
 'mist',
 'screen',
 'slight',
 'shift',
 'left',
 'right',
 'suggest',
 'man',
 'suddenly',
 'white',
 'screen',
 'tugged',
 'see',
 'sheet',
 'covering',
 'presumably',
 'dead',
 'man',
 'william',
 'starks',
 'year',
 'old',
 'first',
 'time',
 'died',
 'int',
 'hospital',
 'kuwait',
 'day',
 'one',
 'tug',
 'sheet',
 'see',
 'suddenly',
 'hear',
 'william',
 'starks',
 'chaos',
 'hospital',
 'around',
 'doctor',
 'nurse',
 'tend',
 'best',
 'injured',
 'soldier',
 'glimpse',
 'starks',
 'reveals',
 'red',
 'stretcher',
 'soaked',
 'blood',
 'severe',
 'head',
 'wound',
 'bullet',
 'minced',
 'skull',
 'slowly',
 'steadily',
 'heartbeat',
 'heard',
 'muffled',
 'sound',
 'hospital',
 'pulse',
 'quickens',
 'pace',
 'world',
 'around',

**Task:** The final part would then be to create a tf-idf representation of the values.  This part continued to crash when I used sklearn, so I suggest using a Spark implementation.  An example of this implementation is available here: 
```
from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF

sc = SparkContext(...)

# Load documents (one per line).
documents = sc.textFile("...").map(lambda line: line.split(" "))

hashingTF = HashingTF()
tf = hashingTF.transform(documents)

tf.cache()
idf = IDF().fit(tf)
tfidf = idf.transform(tf)
```

The resource for this was found here: https://stackoverflow.com/questions/35857837/tfidf-memoryerror-how-to-avoid-this-issue.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
 
tfidf_transformer = TfidfTransformer(smooth_idf=True,use_idf=True)
X = tfidf_transformer.fit_transform(clean_scripts.values())