# Project 6: IMDB

This project involves NLP, decision trees, bagging, boosting, and more!

---

## Load packages

You are likely going to need to install the `imdbpie` package:

    > pip install imdbpie

---

In [32]:
import os
import subprocess
import collections
import re
import csv
import json

import pandas as pd
import numpy as np
import scipy

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

import psycopg2
import requests
from imdbpie import Imdb
import nltk

import urllib
from bs4 import BeautifulSoup
import nltk

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

---

## Part 1: Acquire the Data

You will connect to the IMDB API to query for movies. 

See here for documentation on how to use the package:

https://github.com/richardasaurus/imdb-pie

#### 1. Connect to the IMDB API

In [33]:
from imdbpie import Imdb
imdb = Imdb()
imdb = Imdb(anonymize=True) # to proxy requests

# Creating an instance with caching enabled
# Note that the cached responses expire every 2 hours or so.
# The API response itself dictates the expiry time)
imdb = Imdb(cache=True)

#### 2. Query the top 250 rated movies in the database

In [34]:
imdb.top_250()

[{u'can_rate': True,
  u'image': {u'height': 1388,
   u'url': u'http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg',
   u'width': 933},
  u'num_votes': 1660260,
  u'rating': 9.3,
  u'tconst': u'tt0111161',
  u'title': u'The Shawshank Redemption',
  u'type': u'feature',
  u'year': u'1994'},
 {u'can_rate': True,
  u'image': {u'height': 500,
   u'url': u'http://ia.media-imdb.com/images/M/MV5BMjEyMjcyNDI4MF5BMl5BanBnXkFtZTcwMDA5Mzg3OA@@._V1_.jpg',
   u'width': 333},
  u'num_votes': 1136368,
  u'rating': 9.2,
  u'tconst': u'tt0068646',
  u'title': u'The Godfather',
  u'type': u'feature',
  u'year': u'1972'},
 {u'can_rate': True,
  u'image': {u'height': 500,
   u'url': u'http://ia.media-imdb.com/images/M/MV5BNDc2NTM3MzU1Nl5BMl5BanBnXkFtZTcwMTA5Mzg3OA@@._V1_.jpg',
   u'width': 333},
  u'num_votes': 776109,
  u'rating': 9,
  u'tconst': u'tt0071562',
  u'title': u'The Godfather: Part II',
  u'type': u'feature',
  u'year': u'1974'},
 {u'can_rate': True,
 

#### 3. Make a dataframe from the movie data

Keep the fields:

    num_votes
    rating
    tconst
    title
    year
    
And discard the rest

In [35]:
a= imdb.top_250() #This is a list of 250
len(a) #250 movies

df = pd.DataFrame(a) #movies df
df.head(4)

Unnamed: 0,can_rate,image,num_votes,rating,tconst,title,type,year
0,True,{u'url': u'http://ia.media-imdb.com/images/M/M...,1660260,9.3,tt0111161,The Shawshank Redemption,feature,1994
1,True,{u'url': u'http://ia.media-imdb.com/images/M/M...,1136368,9.2,tt0068646,The Godfather,feature,1972
2,True,{u'url': u'http://ia.media-imdb.com/images/M/M...,776109,9.0,tt0071562,The Godfather: Part II,feature,1974
3,True,{u'url': u'http://ia.media-imdb.com/images/M/M...,1645935,9.0,tt0468569,The Dark Knight,feature,2008


In [36]:
cols = ['num_votes','rating','tconst','title','year']
df2 =df[cols]
df2.head(4)

Unnamed: 0,num_votes,rating,tconst,title,year
0,1660260,9.3,tt0111161,The Shawshank Redemption,1994
1,1136368,9.2,tt0068646,The Godfather,1972
2,776109,9.0,tt0071562,The Godfather: Part II,1974
3,1645935,9.0,tt0468569,The Dark Knight,2008


#### 3. Select only the top 100 movies

In [37]:
df2 = df2[0:101] #Is this in the right order? What is the rating?
print df2
df2.shape
print df2.head(2)

     num_votes  rating     tconst  \
0      1660260     9.3  tt0111161   
1      1136368     9.2  tt0068646   
2       776109     9.0  tt0071562   
3      1645935     9.0  tt0468569   
4       849374     8.9  tt0108052   
5       437922     8.9  tt0050083   
6      1301183     8.9  tt0110912   
7      1194767     8.9  tt0167260   
8       494916     8.9  tt0060196   
9      1321861     8.9  tt0137523   
10     1218200     8.8  tt0120737   
11      821418     8.8  tt0080684   
12     1225084     8.8  tt0109830   
13     1439479     8.8  tt1375666   
14     1080893     8.7  tt0167261   
15      668685     8.7  tt0073486   
16      715698     8.7  tt0099685   
17     1196367     8.7  tt0133093   
18      225396     8.7  tt0047478   
19      895015     8.7  tt0076759   
20      524682     8.7  tt0317248   
21     1004235     8.6  tt0114369   
22      869467     8.6  tt0102926   
23      271085     8.6  tt0038650   
24      727988     8.6  tt0114814   
25      410934     8.6  tt0118799   
2

#### 4. Get the genres and runtime for each movie and add them to the dataframe

There can be multiple genres per movie, so this will need some finessing.

In [38]:
b = df2.tconst #Top 100 movies id's
print b[0:2]

0    tt0111161
1    tt0068646
Name: tconst, dtype: object


In [39]:
def get_title(tt):
    title = imdb.get_title_by_id(tt)
    return title.genres

# df2['genre'] = df2['tconst'].map(lambda x: get_title(df2.tconst[0]))
# Why so slow?

genres4 = []
#Why didn't DataFrame list work?

for x in b[:]:
    c = get_title(x)
    genres4.append(c)
    print x, " : ", c
    
print genres4

tt0111161  :  [u'Crime', u'Drama']
tt0068646  :  [u'Crime', u'Drama']
tt0071562  :  [u'Crime', u'Drama']
tt0468569  :  [u'Action', u'Crime', u'Thriller']
tt0108052  :  [u'Biography', u'Drama', u'History']
tt0050083  :  [u'Crime', u'Drama']
tt0110912  :  [u'Crime', u'Drama']
tt0167260  :  [u'Adventure', u'Drama', u'Fantasy']
tt0060196  :  [u'Western']
tt0137523  :  [u'Drama']
tt0120737  :  [u'Adventure', u'Drama', u'Fantasy']
tt0080684  :  [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']
tt0109830  :  [u'Drama', u'Romance']
tt1375666  :  [u'Action', u'Mystery', u'Sci-Fi', u'Thriller']
tt0167261  :  [u'Action', u'Adventure', u'Drama', u'Fantasy']
tt0073486  :  [u'Drama']
tt0099685  :  [u'Biography', u'Crime', u'Drama']
tt0133093  :  [u'Action', u'Sci-Fi']
tt0047478  :  [u'Action', u'Adventure', u'Drama']
tt0076759  :  [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']
tt0317248  :  [u'Crime', u'Drama']
tt0114369  :  [u'Crime', u'Drama', u'Mystery', u'Thriller']
tt0102926  :  [u'Crime', u'D

In [40]:
def get_runtime(tt):
    title = imdb.get_title_by_id(tt)
    return title.runtime

runtime = []

b = df2.tconst

for x in b[:]:
    c = get_runtime(x)
    runtime.append(c)
    print x, " : ", c
    
print runtime

tt0111161  :  8520
tt0068646  :  10500
tt0071562  :  12120
tt0468569  :  9120
tt0108052  :  11700
tt0050083  :  5760
tt0110912  :  9240
tt0167260  :  12060
tt0060196  :  9660
tt0137523  :  8340
tt0120737  :  10680
tt0080684  :  7440
tt0109830  :  8520
tt1375666  :  8880
tt0167261  :  10740
tt0073486  :  7980
tt0099685  :  8760
tt0133093  :  8160
tt0047478  :  9480
tt0076759  :  7260
tt0317248  :  7800
tt0114369  :  7620
tt0102926  :  7080
tt0038650  :  7800
tt0114814  :  6360
tt0118799  :  6960
tt0110413  :  6600
tt0064116  :  8700
tt0245429  :  7500
tt0120815  :  10140
tt0120586  :  6060
tt0034583  :  6120
tt0816692  :  10140
tt0054215  :  6540
tt0021749  :  5220
tt0082971  :  6900
tt1675434  :  6720
tt0047396  :  6720
tt0027977  :  5220
tt0120689  :  11340
tt0103064  :  9180
tt0253474  :  9000
tt0407887  :  9060
tt0088763  :  6960
tt2582802  :  6420
tt0172495  :  9300
tt0209144  :  6780
tt0078788  :  9180
tt0482571  :  7800
tt0110357  :  5340
tt0057012  :  5700
tt0043014  :  6600
tt0

#### 4. Write the Results to a csv

In [41]:
df2['runtime'] = runtime

df2['genres'] = genres4

df3 = df2

print df2.head(2)

#import io
#df3 = io.open("./out_1_Z.csv", 'w', encoding='utf8')

df3.to_csv('./out_1_Z.csv', encoding='utf-8')

   num_votes  rating     tconst                     title  year  runtime  \
0    1660260     9.3  tt0111161  The Shawshank Redemption  1994     8520   
1    1136368     9.2  tt0068646             The Godfather  1972    10500   

           genres  
0  [Crime, Drama]  
1  [Crime, Drama]  


In [45]:
df3.to_csv('./out_1_ZB.csv', encoding='utf-8')

---

## Part 2: Wrangle the text data

#### 1. Scrape the reviews for the top 100 movies

*Hint*: Use a loop to scrape each page at once

In [42]:
# title = imdb.get_title_reviews(df2.tconst[0])
# a = title
# a = pd.Series(a)
#reviews = imdb.get_title_reviews("tt0468569")

reviews= []

b = df2.tconst

for x in b[:]:
    c = imdb.get_title_reviews(x)
    reviews.append(c)
#    print x, " : ", c
    
print reviews[0]

[<Review: u'Why do I want to wri'>, <Review: u'\nCan Hollywood, usua'>, <Review: u'\nI have never seen s'>, <Review: u'In its Oscar year, S'>, <Review: u'The reason I became '>, <Review: u'\nI believe that this'>, <Review: u'\nOne of my all time '>, <Review: u'\nOne of the finest f'>, <Review: u'Misery and Stand By '>, <Review: u'\nThe Shawshank Redem'>]


#### 2. Extract the reviews and the rating per review for each movie

*Note*: "soup" from BeautifulSoup is the html returned from all 25 pages. You'll need to either address each page individually or break them down by elements.

In [43]:
import requests

movie_tt = b[0]
page_no = 10
x=1

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

# definition to find reviews for each movie
def extract_pages(movie_tt):

    urla = "http://www.imdb.com/title/" + movie_tt + "/reviews?start=" + str(10)
    
    response = requests.get(urla)
    HTML = response.text  
    # This xpath query will find the number of total results

    tot_reviews = "//table/tr/td/text()"
    for_pages = Selector(text=HTML).xpath(tot_reviews)
    page_tot = for_pages.extract()
    #print "Page total: ", page_tot

    page_tot_a = 0
    page_tot_b = 0

    for z in page_tot:
        if "reviews" in z:
            page_tot_b = z
        else:
            page_tot_a = 0

    y = page_tot_b.split()
    page_tot_c = int(y[0])
    #pages = page_tot_c*1.0/10
    pages = page_tot_c

    #import math
    #pages = int(math.ceil(pages))
    #print urla
    return int(pages)
    print pages

In [46]:

def extract_reviews_txt(movie_tt, pages):

    review_txt = []

    review_df=pd.DataFrame()

    for i in range(0,pages,10):
        page_no = i
        urla = "http://www.imdb.com/title/" + movie_tt + "/reviews?start=" + str(page_no)
        response = requests.get(urla)
        HTML = response.text  
        print urla

        for j in range(1,11):
            
            jb = j*2-1
            
            selector_a = "//div[@id='tn15content']/p[" + str(j) + "]/text()"


            results_a = Selector(text=HTML).xpath(selector_a)
            total_results_a = results_a.extract()

            review_txt.append(total_results_a)


    review_df['txt'] = review_txt
    return review_df

pages =20
print extract_reviews_txt(movie_tt, pages)  

http://www.imdb.com/title/tt0111161/reviews?start=0
http://www.imdb.com/title/tt0111161/reviews?start=10
                                                  txt
0   [\nWhy do I want to write the 234th comment on...
1   [\n\nCan Hollywood, usually creating things fo...
2   [\n\nI have never seen such an amazing film si...
3   [\nIn its Oscar year, Shawshank Redemption (wr...
4   [\nThe reason I became a member of this databa...
5   [\n\nI believe that this film is the best stor...
6   [\n\nOne of my all time favorites. Shawshank R...
7   [\n\nOne of the finest films made in recent ye...
8   [\nMisery and Stand By Me were the best adapta...
9   [\n\nThe Shawshank Redemption is without a dou...
10  [\nThis movie is not your ordinary Hollywood f...
11  [\n\nWhenever I talk about this movie with my ...
12  [\nI'm trying to save you money; this is the l...
13  [\nThe Shawshank Redemption is written and dir...
14  [\nAt the heart of this extraordinary movie is...
15  [\n\n**Yes, there are SPOIL

In [47]:
def extract_reviews(movie_tt, pages):

    review_score = []
    review_author = []
    review_title = []
    review_df=pd.DataFrame()

    for i in range(0,pages,10): #increments of 10
        page_no = i
        urla = "http://www.imdb.com/title/" + movie_tt + "/reviews?start=" + str(page_no)
        response = requests.get(urla)
        HTML = response.text  
        print urla

        for j in range(1,11):
            
            jb = j*2-1
            
            selector_b = "//div[" + str(jb) + "]/img/@src"
            selector_c = "//div[@id='tn15content']/div[" + str(jb) + "]/a[2]/text()"


            results_b = Selector(text=HTML).xpath(selector_b)
            
            try:
                total_results_b = results_b[0].extract()
            except:
                total_results_b = "" 
                
            results_c = Selector(text=HTML).xpath(selector_c)
            total_results_c = results_c.extract()

            review_title.append(movie_tt)
            review_score.append(total_results_b)
            review_author.append(total_results_c)
        
    review_df['tt'] = review_title
    review_df['score']= review_score
    review_df['author'] = review_author
    return review_df
    
print extract_reviews(movie_tt, pages)        
# print review_txt[2]
# print review_score[2]
# print review_author[2]

http://www.imdb.com/title/tt0111161/reviews?start=0
http://www.imdb.com/title/tt0111161/reviews?start=10
           tt                                             score  \
0   tt0111161  http://i.media-imdb.com/images/showtimes/100.gif   
1   tt0111161  http://i.media-imdb.com/images/showtimes/100.gif   
2   tt0111161                                                     
3   tt0111161  http://i.media-imdb.com/images/showtimes/100.gif   
4   tt0111161                                                     
5   tt0111161   http://i.media-imdb.com/images/showtimes/80.gif   
6   tt0111161  http://i.media-imdb.com/images/showtimes/100.gif   
7   tt0111161  http://i.media-imdb.com/images/showtimes/100.gif   
8   tt0111161  http://i.media-imdb.com/images/showtimes/100.gif   
9   tt0111161  http://i.media-imdb.com/images/showtimes/100.gif   
10  tt0111161  http://i.media-imdb.com/images/showtimes/100.gif   
11  tt0111161  http://i.media-imdb.com/images/showtimes/100.gif   
12  tt0111161  http://i.

#### 3. Remove the non AlphaNumeric characters from reviews

In [50]:
import re

def remove_nonalphanum(your_string):
    return re.sub(r'\W+', ' ', your_string)

# for i, j in enumerate(review_txt):
#     a = str(review_txt[i])
#     a = a.replace("\\n", " ")
#     a = remove_nonalphanum(a)
#     review_txt[i] = a
    
def remove_nonalphanum_txt(review_txt):
    a = str(review_txt)
    a = a.replace("\\n", " ")
    return re.sub(r'\W+', ' ', a)


#### 4. Calculate the top 200 ngrams from the user reviews

Use the `TfidfVectorizer` in sklearn.

Recommended parameters:

    ngram_range = (1, 2)
    stop_words = 'english'
    binary = False
    max_features = 200

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer
#Week 6, lab 4.1 intro to nlp

def vectorize_me(review_txt):

    string1 = str(review_txt)

    tvec = TfidfVectorizer(
        stop_words='english',
        ngram_range = (1, 2),
        binary = False,
        max_features = 200
                          )
    tvec.fit([string1])

    df  = pd.DataFrame(tvec.transform([string1]).todense(),
                       columns=tvec.get_feature_names(),
                       index=['string1'])

    df.transpose().sort_values('string1', ascending=True) #.head(10) #.transpose()
    return df.columns.values

In [53]:

def extract_reviews_ngram(movie_tt, pages):

    review_txt = []
    review_score = []
    review_author = []
    review_title = []
    review_df=pd.DataFrame()

    review_df=pd.DataFrame()

    for i in range(0,pages,10):
        page_no = i
        urla = "http://www.imdb.com/title/" + movie_tt + "/reviews?start=" + str(page_no)
        response = requests.get(urla)
        HTML = response.text  
        #print urla

        for j in range(1,11):
            
            selector_a = "//div[@id='tn15content']/p[" + str(j) + "]/text()"


            results_a = Selector(text=HTML).xpath(selector_a)
            total_results_a = results_a.extract()
            
            total_results_a = remove_nonalphanum_txt(total_results_a)
            
            try:
                total_results_a = vectorize_me(total_results_a)
            except:
                total_results_a = ""
                
            review_txt.append(total_results_a)

            jb = j*2-1
            
            selector_b = "//div[" + str(jb) + "]/img/@src"
            selector_c = "//div[@id='tn15content']/div[" + str(jb) + "]/a[2]/text()"


            results_b = Selector(text=HTML).xpath(selector_b)
            
            try:
                total_results_b = results_b[0].extract()
            except:
                total_results_b = "" 
                
            results_c = Selector(text=HTML).xpath(selector_c)
            total_results_c = results_c.extract()

            review_title.append(movie_tt)
            review_score.append(total_results_b)
            review_author.append(total_results_c)        

    review_df['title'] = review_title
    review_df['txt'] = review_txt
    review_df['score'] = review_score
    review_df['author'] = review_author
    
    return review_df
    

#### 5. Merge the user reviews and ratings

In [None]:
review_df2 = pd.DataFrame(columns=['title','txt','score','author'])
review_df3 = pd.DataFrame()

dummy = 0

for i, j in enumerate(b):
    maxreviews = extract_pages(b[i])
    #this is taking too long, limit max reviews to 1000
    maxreviews = min(maxreviews, 1000)
    print ("Review", i, " ", maxreviews)
    #maxreviews = 20

    try:
        review_df3 = extract_reviews_ngram(b[i], maxreviews)
        review_df2 = review_df2.append(review_df3, ignore_index=True)
    except:
        dummy=0

In [None]:
print review_df2.head(50)


#### 6. Save this merged dataframe as a csv

In [None]:
review_df2.to_csv('./out_2.csv', encoding='utf-8')

---

## Part 3: Combine Tables in PostgreSQL

#### 1. Import your two .csv data files into your Postgre Database as two different tables

In [None]:
#week 5 lab 3.3 local postgres lab - postico
#Connect to local postgres server


For ease, we can call these table1 and table2

In [182]:
import pandas as pd
df_table1 = pd.read_csv('./out_1.csv')
df_table2 = pd.read_csv('./out_2.csv')

#df.columns = [c.lower() for c in df.columns] #postgres doesn't like capitals or spaces
#psql -h dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com -p 5432 -U dsi_student titanic
#engine = create_engine('postgresql://dsi_student:gastudents@dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com:5432/titanic')



from sqlalchemy import create_engine
#engine = create_engine('postgresql://username:password@localhost:5432/dbname')
engine = create_engine('postgresql://noriogura:@localhost:5432/project6')

df_table1.to_sql("table1", engine)
df_table2.to_sql("table2", engine)



In [186]:
#pd.read_sql("SELECT * FROM my_table_name, engine)
            
pd.read_sql("SELECT * FROM pg_catalog.pg_tables WHERE schemaname='public'", con=engine)

Unnamed: 0,schemaname,tablename,tableowner,tablespace,hasindexes,hasrules,hastriggers,rowsecurity
0,public,untitled_table,noriogura,,True,False,False,False
1,public,my_table_name,noriogura,,True,False,False,False


In [188]:
df_table1 = pd.read_sql("SELECT * FROM table1", con=engine)
df_table1.head(10)

Unnamed: 0.1,index,Unnamed: 0,num_votes,rating,tconst,title,year,runtime,genres
0,0,0,1659464,9.3,tt0111161,The Shawshank Redemption,1994,8520,"[u'Crime', u'Drama']"
1,1,1,1135853,9.2,tt0068646,The Godfather,1972,10500,"[u'Crime', u'Drama']"
2,2,2,775753,9.0,tt0071562,The Godfather: Part II,1974,12120,"[u'Crime', u'Drama']"
3,3,3,1645164,9.0,tt0468569,The Dark Knight,2008,9120,"[u'Action', u'Crime', u'Thriller']"
4,4,4,848954,8.9,tt0108052,Schindler's List,1993,11700,"[u'Biography', u'Drama', u'History']"
5,5,5,437710,8.9,tt0050083,12 Angry Men,1957,5760,"[u'Crime', u'Drama']"
6,6,6,1300563,8.9,tt0110912,Pulp Fiction,1994,9240,"[u'Crime', u'Drama']"
7,7,7,1194223,8.9,tt0167260,The Lord of the Rings: The Return of the King,2003,12060,"[u'Adventure', u'Drama', u'Fantasy']"
8,8,8,494702,8.9,tt0060196,"The Good, the Bad and the Ugly",1966,9660,[u'Western']
9,9,9,1321233,8.9,tt0137523,Fight Club,1999,8340,[u'Drama']


In [None]:
df_table2 = pd.read_sql("SELECT * FROM table2", con=engine)
df_table2.head(10)

#### 2. Connect to database and query the joined set

In [None]:
#week6-2.1-sql-joins

sql = """
SELECT * from table2 rev_table
LEFT JOIN table1 genre_table
ON rev_table."tconst" = genre_table."tconst"
) as table_12
"""

df = pd.read_sql(sql, con=engine)
df.head(3)

#### 3. Join the two tables 

In [None]:
df.to_sql("table_12", engine)

#### 4. Select the newly joined table and save two copies of the into dataframes

In [None]:
df_table12A = pd.read_sql("SELECT * FROM table_12", con=engine) #copy1
df_table12B = df_table12A #copy2?why

---

## Part 4: Parsing and Exploratory Data Analysis

#### 1. Rename any columns you think should be renamed for clarity

#### 2. Describe anything interesting or suspicious about your data (quality assurance)

#### 3. Make four visualizations of interest to you using the data

---

## Part 5: Decision Tree Classifiers and Regressors

#### 1. What is our target attribute? 

Choose a target variable for the decision tree regressor and the classifier. 

#### 2. Prepare the X and Y matrices and preprocess data as you see fit

#### 3. Build and cross-validate your decision tree classifier

#### 4. Gridsearch optimal parameters for your classifier. Does the performance improve?

#### 5. Build and cross-validate your decision tree regressor

#### 6. Gridsearch the optimal parameters for your classifier. Does performance improve?

---

## Part 6: Elastic Net


#### 1. Gridsearch optimal parameters for an ElasticNet using the regression target and predictors you used for the decision tree regressor.


#### 2. Is cross-validated performance better or worse than with the decision trees? 

#### 3. Explain why the elastic net may have performed best at that particular l1_ratio and alpha

---

## Part 7: Bagging and Boosting: Random Forests, Extra Trees, and AdaBoost

#### 1. Load the random forest regressor, extra trees regressor, and adaboost regressor from sklearn

#### 2. Gridsearch optimal parameters for the three different ensemble methods.

#### 3. Evaluate the performance of the two bagging and one boosting model. Which performs best?

#### 4. Extract the feature importances from the Random Forest regressor and make a DataFrame pairing variable names with their variable importances.

#### 5. Plot the ranked feature importances.

#### 6.1 [BONUS] Gridsearch an optimal Lasso model and use it for variable selection (make a new predictor matrix with only the variables not zeroed out by the Lasso). 

#### 6.2 [BONUS] Gridsearch your best performing bagging/boosting model from above with the features retained after the Lasso. Does the score improve?

#### 7.1. [BONUS] Select a threshold for variable importance from your Random Forest regressor and use that to perform feature selection, creating a new subset predictor matrix.

#### 7.2 [BONUS] Using BaggingRegressor with a base estimator of your choice, test a model using the feature-selected dataset you made in 7.1

---

## [VERY BONUS] Part 8: PCA

#### 1. Perform a PCA on your predictor matrix

#### 2. Examine the variance explained and determine what components you want to keep based on them.

#### 3. Plot the cumulative variance explained by the ordered principal components.

#### 4. Gridsearch an elastic net using the principal components you selected as your predictors. Does this perform better than the elastic net you fit earlier?

#### 5. Gridsearch a bagging ensemble estimator that you fit before, this time using the principal components as predictors. Does this perform better or worse than the original? 

#### 6. Look at the loadings of the original predictor columns on the first 3 principal components. Is there any kind of intuitive meaning here?

Hint, you will probably want to sort by absolute value of magnitude of loading, and also only look at the obviously important (larger) ones!

# [Extremely Bonus] Part 9:  Clustering

![](https://snag.gy/jPSZ6U.jpg)

 ***Bonus Bonus:***
This extended bonus question is asking to do something we never really talked about but would like for you to attempt based on the assumptions that we learned during this weeks clustering lesson(s).

#### 1. Import your favorite clustering module

#### 2. Encode categoricals

#### 3. Evaluate cluster metics solely based on a range of K
If K-Means:  SSE/Inertia vs Silhouette (ie: Elbow), silhouette average, etc

#### 4.  Look at your data based on the subset of your predicted clusters.
Assign the cluster predictions back to your dataframe in order to see them in context.  This is great to be able to group by cluster to get a sense of the data that clumped together.

#### 5. Describe your findings based on the predicted clusters 
_How well did it do?  What's good or bad?  How would you improve this? Does any of it make sense?_