##### movies-q3 notebook
***

<h1>Question 1 - What genres have been the most lucrative?</h1>

    1. What genres appear most frequently?
    2. Which earned the most in 2018? ...2016-18?
    3. How predictable is ROI by genre?
    4. What mix of genres do top studios use?

### importing required libraries

In [57]:
import os # for setting the current directory

import numpy as np
import pandas as pd

import sqlite3
import pandasql

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

In [58]:
# setting the current working directory
os.chdir("c:/users/jd/flatiron/project01/dsc-mod-1-project-v2-1-online-ds-ft-120919/")

# printing the current working directory
print(os.getcwd())

c:\users\jd\flatiron\project01\dsc-mod-1-project-v2-1-online-ds-ft-120919


### Connecting to sqlite database

In [59]:
# connecting to sql movies_db data source and instantiate a cursor
conn = sqlite3.connect("movies_db.sqlite")
cur = conn.cursor()

### listing the sqlite table names

In [60]:
# getting names of all tables in the sql database
sql_tables = conn.execute("select name from sqlite_master where type='table';")

# assigning table names to a variable as a list for future iteration
table_list = list(map(lambda x: x[0], sql_tables.fetchall()))
table_list

['clean_bom_tbl', 'clean_tn_tbl', 'clean_imdb_title_tbl', 'studio_titles_tbl']

***

<h2 align='center'><font color='chocolate'>Genres</font></h2>

We can find titles with their genres in the `clean_imdb_title_tbl`.

### Querying sqlite3 to view `clean_imdb_title_tbl`

In [61]:
# connecting to the required table as a temporary table
# using temporary table to inspect data
cur.execute('''SELECT *
                    FROM clean_imdb_title_tbl
                    WHERE start_year BETWEEN 2016 AND 2018
                    ;''')

data_df = pd.DataFrame(cur.fetchall())
data_df.columns = [x[0] for x in cur.description]

print(data.shape)
data_df.head(3)

(49462, 33)


Unnamed: 0,index,tconst,primary_title,original_title,start_year,runtime_minutes,genres,documentary,adult,biography,...,action,history,fantasy,sport,adventure,musical,comedy,game-show,drama,news
0,2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",0,0,0,...,0,0,0,0,0,0,1,0,1,0
2,4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",0,0,0,...,0,0,1,0,0,0,1,0,1,0


### Dropping columns and rows

In [62]:
# deleting all rows where column `genres` has value 'None' 
print("Starting count of empty 'genres':", data_df.genres.isnull().sum())
data_df = data_df[data_df['genres'].notnull()]
print("Ending count of empty 'genres':", data_df.genres.isnull().sum())

Starting count of empty 'genres': 2163
Ending count of empty 'genres': 0


In [63]:
# removing unnecessary columns
data_df.drop(columns = ['index', 'original_title', 'runtime_minutes'])

# setting the index
data_df.set_index("tconst", inplace=True)

In [64]:
# copying to our separate, working dataframe
clean_imdb_title_df = data_df.copy(deep=True)

print(clean_imdb_title_df.shape)

# verifying removal of rows with no genres
clean_imdb_title_df[clean_imdb_title_df['genres'].isnull() == True]

(49462, 33)


Unnamed: 0_level_0,index,primary_title,original_title,start_year,runtime_minutes,genres,documentary,adult,biography,romance,...,action,history,fantasy,sport,adventure,musical,comedy,game-show,drama,news
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


No rows without genres remain

<h2 align='center'><font color='coral'>Counting Genres</font></h2>

### Counting the occurrences of any individual genre

In [None]:
clean_imdb_title_df.genres.isnull().sum()

In [None]:
clean_imdb_title_df.query('genres.isnull() == True')

In [None]:
clean_imdb_title_df.drama.value_counts()

In [None]:
for cols in clean_imdb_title_df.columns():
    col_count

### Add `...` to sqlite

***

<h2 align='center'><font color='soral'>Recommendations</font></h2>

***

In [None]:
# looking at ...
# plt.figure(figsize=(8, 5))
# sns.violinplot(x='year', y='Total_gross', data=bom_2016_18_df)

In [None]:
import matplotlib.pyplot as plt
import matplotlib

matplotlib.style.use('ggplot')


data = [[2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002],
        ['Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar'],
        [1, 2, 3, 4, 5, 6, 7, 8, 9]]

rows = zip(data[0], data[1], data[2])
headers = ['Year', 'Month', 'Value']
df = pd.DataFrame(rows, columns=headers)

df

In [None]:
pivot_df = df.pivot(index='Year', columns='Month', values='Value')
pivot_df

In [None]:
colors = ["#006D2C", "#31A354","#74C476"]
#Note: .loc[:,['Jan','Feb', 'Mar']] is used here to rearrange the layer ordering
pivot_df.loc[:,['Jan','Feb', 'Mar']].plot.barh(stacked=True, color=colors, figsize=(10,7))

***