In [1]:
import pandas as pd
import numpy as np

# Step 1: Read in the Data

Data is being read in from [learn-co-students](https://github.com/learn-co-students/dc_ds_06_03_19/tree/master/module_1/week_3_project/data)

Each dataframe is named after its filename (with underscores instead of '.') and followed by 'df'. 
The following cell usually shows .head(), but will/should be simple big pictures of what is in the file.

***


In [2]:
tn_movie_budgets_df = pd.read_csv(' tn.movie_budgets.csv.gz')

FileNotFoundError: [Errno 2] No such file or directory: 'data/tn.movie_budgets.csv.gz'

In [None]:
tn_movie_budgets_df.head()

In [None]:
imdb_title_ratings_df = pd.read_csv('data/imdb.title.ratings.csv.gz')

In [None]:
imdb_title_ratings_df.head()

# Summary of Step 1

## First Round of Questions About Data

1. Can we find out how these movies were released? as in on Netflix/Hulu/Box Office/Amazon Prime/YouTube?
2. How are movie genres determined?
3. How dirty are any of these data sets?
4. Do we know that grosses are USD?
5. And do dollars account for inflation?
6. How much rounding is going on in these grosses?

## Ideas About What We Can Ask

1. Can movies be very popular if they are released closely together? Maybe viewer fatigue is something to watch out for...
2. What is the difference between movies which do well both in the US and abroad, and movies which only do well in one or the other?
3. What is the minimum level of movie budget that correlated with different levels of US/Worldwide income?
4. **Of producers who are known for recently released movieswith no domestic grossn(streamed releases), who spends the most and has the highest ratings**

# Step 2: Let's try to clean some of this data
***

## Clean with tn_movie_budgets_df

In [None]:
tn_movie_budgets_df.head()

In [None]:
tn_movie_budgets_df.info()

In [None]:
tn_movie_budgets_df.shape

***
### Are any rows duplicated?
***

In [None]:
tn_movie_budgets_df.duplicated().sum()

In [None]:
tn_movie_budgets_df.duplicated('movie').sum()

In [None]:
tn_movie_budgets_df['repeated_name'] = tn_movie_budgets_df.movie.duplicated(keep=False)

In [None]:
tn_movie_budgets_df[tn_movie_budgets_df['repeated_name'] == True].sort_values('movie')

There appear to be no actually duplicated data, all of the duplicated rows appear to be remakes of an original. Lets change the name of that column from ```repeated_name``` to ```remade```.

In [None]:
tn_movie_budgets_df['repeated_name'] = tn_movie_budgets_df.movie.duplicated()

In [None]:
tn_movie_budgets_df.rename(columns={'repeated_name':'remade'}, inplace=True)

No, there aren't duplicated rows in a negative sense. 
Just remade movies. 
We're okay with that!

***
### Are their weird values?
***

While checking for duplicates, we noticed that there were 0s in the ```domestic_gross``` and ```worldwide_gross``` columns.
Lets do the following:
- [ ] first, convert all money columns to ints from objects, 
- [ ] second, make a new column called ```international_gross``` which is ```worldwide_gross```$-$```domestic_gross```, and
- [ ] finally, decide whether or not to drop rows with too many \$0s.
***

In [None]:
tn_movie_budgets_df.dtypes

In [None]:
tn_movie_budgets_df = tn_movie_budgets_df.astype({'production_budget':'str', 'domestic_gross':'str', 'worldwide_gross':'str'})

The following 5 cells should be rolled into one function called convert_money_obj_to_int

In [None]:
def get_rid_of_dollar_sign(amount):
    if amount.startswith('$'):
        amount = amount[1:]
    else:
        print('Crap, one of my values didn\'t start with a dollar sign')
    return amount

In [None]:
for title in ['production_budget', 'domestic_gross', 'worldwide_gross']:
    tn_movie_budgets_df[title] = tn_movie_budgets_df[title].map(get_rid_of_dollar_sign)

In [None]:
for title in ['production_budget', 'domestic_gross', 'worldwide_gross']:
    tn_movie_budgets_df[title] = tn_movie_budgets_df[title].map(lambda x: x.replace(',','_'))

In [None]:
tn_movie_budgets_df.head()

In [None]:
tn_movie_budgets_df = tn_movie_budgets_df.astype({'production_budget':'int64', 'domestic_gross':'int64', 'worldwide_gross':'int64'})

In [None]:
tn_movie_budgets_df.info()

***
- [x] first, convert all money columns to ints from objects, 
- [ ] second, make a new column called ```international_gross``` which is ```worldwide_gross```$-$```domestic_gross```, and
- [ ] finally, decide whether or not to drop rows with too many \$0s.
***

While we're at it, we might as well make all the columns the appropriate data types...

In [None]:
tn_movie_budgets_df = tn_movie_budgets_df.astype({'movie':'str'})

In [None]:
tn_movie_budgets_df.release_date = pd.to_datetime(tn_movie_budgets_df.release_date)

In [None]:
tn_movie_budgets_df.info()

...Okay, back to the business at hand.

In [None]:
tn_movie_budgets_df['international_gross'] = tn_movie_budgets_df['worldwide_gross'] - tn_movie_budgets_df['domestic_gross']

- [x] first, onvert all money columns to ints from objects, 
- [x] second, make a new column called ```international_gross``` which is ```worldwide_gross```$-$```domestic_gross```, and
- [ ] finally, decide whether or not to drop rows with too many \$0s.

In [None]:
tn_movie_budgets_df.loc[tn_movie_budgets_df.international_gross == 0].shape

***
Thats a lot of movies with no international gross!

Now I'm concerned about movies with no worldwide gross or no domestic gross. 
Lets see how many of those there are.
***

In [None]:
tn_movie_budgets_df.loc[(tn_movie_budgets_df.domestic_gross) == 0].shape

In [None]:
tn_movie_budgets_df.loc[(tn_movie_budgets_df.domestic_gross) == 0].head()

In [None]:
tn_movie_budgets_df.loc[(tn_movie_budgets_df.worldwide_gross | tn_movie_budgets_df.domestic_gross) == 0].shape

In [None]:
tn_movie_budgets_df.loc[(tn_movie_budgets_df.worldwide_gross | tn_movie_budgets_df.domestic_gross) == 0].head()

In [None]:
tn_movie_budgets_df.loc[(tn_movie_budgets_df.domestic_gross) == 0].loc[(tn_movie_budgets_df.worldwide_gross) > 0].shape

In [None]:
tn_movie_budgets_df.loc[(tn_movie_budgets_df.domestic_gross) == 0].loc[(tn_movie_budgets_df.worldwide_gross) > 0].head()

In [None]:
tn_movie_budgets_df.loc[(tn_movie_budgets_df.worldwide_gross) == 0].shape

In [None]:
tn_movie_budgets_df.loc[(tn_movie_budgets_df.worldwide_gross) == 0].head()

There are 548 movies with 0 dollars in ```domestic_gross```.
Of those movies, 181 have ```international_gross``` (maybe were only released internationally), and the other 367 have no ```domestic_gross``` or ```international_gross```. 
We think these 367 movies were released online only, which means they are of particular interest to our analysis!
***
So we are deciding to keep all of our data:
- [x] first, onvert all money columns to ints from objects, 
- [x] second, make a new column called ```international_gross``` which is ```worldwide_gross```$-$```domestic_gross```, and
- [x] finally, decide whether or not to drop rows with too many \$0s.


We are also deciding to compare the set of data we think are online releases to the data we think are not. 
Let's create a new column to mark their differences, then continue cleaning the data by isolating the released between 2010 and 2018.

In [None]:
tn_movie_budgets_df['online_release'] = tn_movie_budgets_df.worldwide_gross.map(lambda x: x==0)

### Get rid of movies not released between 2010 and 2018

In [None]:
tn_movie_budgets_df['release_year'] = tn_movie_budgets_df.release_date.map(lambda x: x.year)

In [None]:
recent_tn_movie_budgets_df = tn_movie_budgets_df.loc[(2010<=tn_movie_budgets_df['release_year']) & (tn_movie_budgets_df['release_year']<=2018)]

In [None]:
recent_tn_movie_budgets_df.online_release.value_counts()

There are still 251 online releases and 1873 box office releases; enough to perform some analysis.

# Step 3: Visualization

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
sns.lmplot('release_date', 'production_budget', data=recent_tn_movie_budgets_df.loc[recent_tn_movie_budgets_df['online_release']==True], fit_reg=False)

In [None]:
sns.lmplot('release_date', 'production_budget', data=recent_tn_movie_budgets_df.loc[recent_tn_movie_budgets_df['online_release']==False], fit_reg=False)

In [None]:
sns.lmplot('production_budget', 'worldwide_gross', data=recent_tn_movie_budgets_df.loc[recent_tn_movie_budgets_df['online_release']==False], fit_reg=False)

In [None]:
sns.lmplot('production_budget', 'domestic_gross', data=recent_tn_movie_budgets_df.loc[recent_tn_movie_budgets_df['online_release']==False], fit_reg=False)

In [None]:
sns.lmplot('production_budget', 'international_gross', data=recent_tn_movie_budgets_df.loc[recent_tn_movie_budgets_df['online_release']==False], fit_reg=False)

## Next Ideas

Can we guess how much the online releases made based on the data from the box office releases?
To do that, we could:
1. Find (linear) regression between production value and different grosses
2. After merging data, find relationship between ratings and different grosses
3. If relationships are similar, use ratings as a proxy for gross in online release data
4. Estimate the return on production value for online releases

In [None]:
tn_movie_budgets_df.head()