# HW2 - Develop a Categorization System#

This problem set is meant to help you familiarize yourself with Python and Pandas. 

### Before You Start
For this problem set, you should download INF0202-HW2.ipynb from bCourses. Create a local copy of the notebook and rename it LASTNAME_FIRSTNAME-HW2.ipynb. Then edit your renamed file directly in your browser by typing:
```
jupyter notebook <name_of_downloaded_file>
```

Make sure the following libraries load correctly (hit Shift-Enter).

In [45]:
#IPython is what you are using now to run the notebook
import IPython
print("IPython version:      %6.6s (need at least 1.0)" % IPython.__version__)

# Pandas makes working with data tables easier
import pandas as pd
print("Pandas version:       %6.6s (need at least 0.11.0)" % pd.__version__)

import numpy as np
# Module for plotting
import matplotlib as plt
%matplotlib inline
print("Maplotlib version:    %6.6s (need at least 1.2.1)" % plt.__version__)

# A tool we'll use to aid our data exploration
import itertools

IPython version:       7.8.0 (need at least 1.0)
Pandas version:       0.25.3 (need at least 0.11.0)
Maplotlib version:     3.1.1 (need at least 1.2.1)


### Working in a group?
List the names of other students with whom you worked on this problem set:

### Introduction to the assignment

For this assignment and upcoming assignments, you will be using an IMDB Movie Dataset (download from bCourses). 

Use the following commands to load the dataset:

In [3]:
#load dataset
imdb = pd.read_csv("IMDB_movies.csv", low_memory=False, encoding = "ISO-8859-1")

#print(imdb)

#subset to only first 100 movies
imdb = imdb[:100]

     Rank                    Title                     Genre  \
0       1  Guardians of the Galaxy   Action,Adventure,Sci-Fi   
1       2               Prometheus  Adventure,Mystery,Sci-Fi   
2       3                    Split           Horror,Thriller   
3       4                     Sing   Animation,Comedy,Family   
4       5            Suicide Squad  Action,Adventure,Fantasy   
..    ...                      ...                       ...   
995   996     Secret in Their Eyes       Crime,Drama,Mystery   
996   997          Hostel: Part II                    Horror   
997   998   Step Up 2: The Streets       Drama,Music,Romance   
998   999             Search Party          Adventure,Comedy   
999  1000               Nine Lives     Comedy,Family,Fantasy   

                 Director                                             Actors  \
0              James Gunn  Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...   
1            Ridley Scott  Noomi Rapace, Logan Marshall-Green, Michael 

### Understanding the data

Let's take a look at the dataset with some lightweight exploratory data analysis.

In [3]:
imdb.head()

Unnamed: 0,Rank,Title,Genre,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


Are there any nulls we need to watch out for?

In [4]:
imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 11 columns):
Rank                  100 non-null int64
Title                 100 non-null object
Genre                 100 non-null object
Director              100 non-null object
Actors                100 non-null object
Year                  100 non-null int64
Runtime (Minutes)     100 non-null int64
Rating                100 non-null float64
Votes                 100 non-null int64
Revenue (Millions)    91 non-null float64
Metascore             94 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 8.7+ KB


Since most of these films in the data set are part of multiple genres, let's get a list of entirely unique genres, without repeats, to see how many genres we are working with.

In [113]:
unique_genres = imdb['Genre'].unique()
# print(imdb['Genre'])
individual_genres = []

#iterate through, get each combination
for genre in unique_genres:
    individual_genres.append(genre.split(','))
#print(individual_genres)

#get individual genres per row
individual_genres = list(itertools.chain.from_iterable(individual_genres))

#remove duplicates
individual_genres = set(individual_genres) #set removes duplicates???

individual_genres

{'Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Drama',
 'Family',
 'Fantasy',
 'History',
 'Horror',
 'Music',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western'}

### Brainstorm a Categorization System

Categories provide the framework for organizing resources. Classification assigns individual resources to categories. When humans classify, we have rationales for how we assign resources to categories. These criteria are in part how we carve up the categories themsevles. The "principles" for defining categories (enumeration, properties, similarity, cultural vs individual vs institutional) are embodied in the classifications that use these principles. 

#### Your task is to create 3 new categories (columns) for this dataset. Before beginning, please outline responses to the questions below. 
1. What is the purpose of these categories? How might each of these categories be used in an information retrieval task?
2. What "principles" will you be using to define categories? Briefly explain why you've chosen these principles to define your categories. 
3. Are the categories at a consistent level of abstraction and granularity? Briefly explain your choice of abstraction and granularity for each category.
4. What are the data types of your categories? Ordinal? Categorical? Continuous? Other? Briefly explain your choice of data type for each category. 

Categories:

1. Filming Location
2. Bacon Meter
3. Bad Ass Movies

In [None]:
# question 1
'''
Filming Location: Lets people see where the film was shot (e.g. L.A., Georgia, N.Y.). Matched with cost of production, 
this could be useful for directors to determine locations that are more cost-effective.

Bacon Meter: Allows users to sort by movies with the highest or lowest bacon score
(how many degrees an actor is from Kevin Bacon). This would be useful for users looking to find movies that are 
not tainted by Bacon's malevolent hollywood web. 

Bad-ass Movies: Catalogue movies that are highly rated (above 7.3*) Action, Adventure, Sci-Fi, or Fantasy movies.
Would be good for users looking for a fun action-packed night. 

*controlled for given that IMDB is normally more harsh on action movies 
'''

In [64]:
# question 2
'''
These categories will be using classical principles. These categories have rigid properties
(e.g. every movie has been filmed in a primary location, every actor is a certain distance from
Kevin Bacon, every movie has a genre and a rating). This means it makes sense '''

'\nThese categories will be using classical principles. These categories have rigid properties\n(e.g. every movie has a fixed amount it costs to make, every actor has a certain distance fromproperties '

In [65]:
# question 3
'''
For my film location category, I am showing the city in which the movie was filmed.
This is an granular category for location. While we could indicate the country in 
which a film was made, this might be less useful for smaller movie production companies, which
might not have a choice of international locations. Additionally, in the US, there are often
states (such as Georgia) and cities (such as L.A.) that provide specific tax credits for
movies that are made there. To this extent, it makes sense to convey this information for people
that are looking to retrieve it.

For my Bacon meter, this would be more of an abstraction. Given that one's Kevin Bacon number must be
an integer (one cannot have a 1.5 Bacon number), there is no way to make be more or less granular
in choosing a Bacon number. 

My third category, Badass movies, is abstract because it is combining different categories (genre and rating)
to define a new one. While one could be more granular in the rating criteria or in the breadth of genres selected,
this might generate a different category that would be distinct. There is no hard property of "bad-ass either", 
it is my own personal distinction. 
'''


'\nFor my film location category, this is an granular abstraction of location. Rather than saying\n'

In [None]:
# question 4
'''
Film Location: Nominal; while films are occasionally shot in multiple locations, frequently films
have a primary location where they are made. These primary locations are distinct from each other. 

Bacon Meter: Ordinal; Ones bacon number is a ranking of how far away you are from Kevin Bacon. Because
of this, each number is ordered in its categorization (1 is closer than 2, 2 is closer than 3)

Bad Ass movies: Binary; movies aren't 78% bad ass, they're either bad-ass or they're not.

'''

### Develop a Categorization System

Using the data contained in the dataframe `imdb` created above, create three new categories and append them as new columns to the dataframe `imdb`. Becuase there are only 100 rows, you can either assign categories by hand or use a function to do so.
_Hint: if using a function, it may be useful to use the function pandas.DataFrame.apply._

In [118]:
# your code here - category 1
import random
def rand_city():
    city_list = ['San Francisco, CA', 'London, UK', 'Tokyo, Japan', 'Barcelona, Spain', 'Chicago, IL', 'Liverpool, UK', 'New York, NY', 'Toronto, ON', 'Atlanta, GA', 'Los Angeles, CA', 'Rome, Italy', 'Philadelphia, PA', 'Boston, MA']  
    return random.choice(city_list)
 
imdb['Filming Location'] = 'test'
imdb['Filming Location']= imdb['Filming Location'].apply(lambda x: rand_city())

imdb.head()


Unnamed: 0,Rank,Title,Genre,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Filming Location,Badass Movies,Lowest Bacon Number
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,"Rome, Italy",True,4: Vin Diesel
1,2,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,"Rome, Italy",False,2: Noomi Rapace
2,3,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,"Toronto, ON",False,9: Haley Lu Richardson
3,4,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,"Toronto, ON",False,12: Matthew McConaughey
4,5,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,"Atlanta, GA",False,5: Jared Leto


In [115]:
# your code here - category 2
import random
def badass(x):
    genre = x['Genre'].split(',')
   
    bad_ass = False
    BA_List = ['Action', 'Adventure', 'Sci-Fi', 'Fantasy'] #acceptable badass genres
    for i in genre:
        if i in BA_List:
            bad_ass = True #sets badass to true if the genre of the movie is in our bad ass list
    return bad_ass and (x['Rating'] >= 7.3) #return boolean of badass list and whether the rating is high enough


imdb['Badass Movies'] = 'default'
imdb['Badass Movies'] = imdb[['Genre','Rating','Badass Movies']].apply(lambda x: badass(x), axis=1)

imdb.head()

Unnamed: 0,Rank,Title,Genre,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Filming Location,Badass Movies
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,"Philadelphia, PA",True
1,2,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,"Barcelona, Spain",False
2,3,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,"Chicago, ILLiverpool, UK",False
3,4,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,"Chicago, ILLiverpool, UK",False
4,5,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,"Boston, MA",False


In [116]:
# your code here - category 3
import random
def rand_number(x):
    actors = x['Actors'].split(',')
    
    return str(random.randint(1, 12)) + ': ' + str(random.choice(actors)) # we don't actually have a bacon number dataset
#so we are randomizing the return by choose between 1,12 and one of the actors in the Actors column



imdb['Lowest Bacon Number'] = 'default'
imdb['Lowest Bacon Number'] = imdb[['Actors','Lowest Bacon Number']].apply(lambda x: rand_number(x), axis=1)

imdb.head()

Unnamed: 0,Rank,Title,Genre,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Filming Location,Badass Movies,Lowest Bacon Number
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,"Philadelphia, PA",True,4: Vin Diesel
1,2,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,"Barcelona, Spain",False,2: Noomi Rapace
2,3,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,"Chicago, ILLiverpool, UK",False,9: Haley Lu Richardson
3,4,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,"Chicago, ILLiverpool, UK",False,12: Matthew McConaughey
4,5,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,"Boston, MA",False,5: Jared Leto


### Display Final Categorization System

Calling "imdb.head" should result in the full dataset, with three additional categories created. 

In [117]:
imdb.head()

Unnamed: 0,Rank,Title,Genre,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Filming Location,Badass Movies,Lowest Bacon Number
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,"Philadelphia, PA",True,4: Vin Diesel
1,2,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,"Barcelona, Spain",False,2: Noomi Rapace
2,3,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,"Chicago, ILLiverpool, UK",False,9: Haley Lu Richardson
3,4,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,"Chicago, ILLiverpool, UK",False,12: Matthew McConaughey
4,5,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,"Boston, MA",False,5: Jared Leto
