## Correlation and Regression Part 1

*(Coding along with the Udemy Couse [Python for Business and Finance](https://www.udemy.com/course/complete-python-for-business-and-finance-bootcamp/) by Alexander Hagmann.)*

### Covariance & Correlation

- The project goal here is to calculate and interpret the covariance and the correlation coefficient between budget and revenue for Movies that were released in 2016.

- We'll have to test whether the correlation coefficient is significantly different from zero on a 5% level of significance.

- Finally we're going to visualize the relationship between budget and revenue.

### OLS Regression and ANOVA

Some more tasks we're going to tackle in this project:

- Creating a simple Linear Regression Model between budget (independent variable) and revenue (dependent variable) for movies that were released
in 2016. We're going to calculate & interpret the regression coefficients.

- Performing an Analysis of Variance (ANOVA) and calculatimg and interpreting the coefficient of determination.

- Performing a hypothesis tests (two sided) on intercept and slope (1% level of significance). This will answer the questionif the feature budget is statistically significant.

### Multiple Regression

- Creating a Multiple Regression Model explaining the dependent variable revenue for movies that were released between 2010 and 2016.

- Creating/Engineering features (e.g. dummy variables ) and drop non significant features (model specification).

- Determining the model's __goodness of fit__.

- Performing and interpreting an F Test.

### Application in Finance: Fama French Factor Models

Create and interpret the following regression models for Microsoft (MSFT) using daily returns between 2016 and 2018:

- Single Factor Model / CAPM

- Fama French Three Factor Model

- Fama French Five Factor Model

Which Factors significantly explain Microsoft Returns (1% level of significance)?

Calculate Alpha and test whether Alpha is statistically significant

### Issues in Regression Analysis

Detect and handle/correct the following Issues in Linear Regression:

- Outliers

- Non Linear Relationships

- Multicollinearity

- Heteroskedasticity

- Serial Correlation (Autocorrelation)

### Logistic Regression

After getting all the answers about the Microsoft stock we'll have a look at another dataset and create a
Logistic Regression Model and determine the factors that significantly influenced the probability to survive the Titanic Disaster (1% level of significance).

###  Getting, Cleaning and Preparing the Data

In [43]:
import pandas as pd

In [44]:
# loading the metadate of kaggle's movies dataset 
# which can be found at https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset
# the dataset has columns with mixed data types we should set a type for
# setting low_memory to False will load it nevertheless (without setting types)
movie = pd.read_csv("../assets/data/movies_metadata.csv", low_memory=False)

In [45]:
movie # looks quite messy

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,...,,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,False,,0,"[{'id': 18, 'name': 'Drama'}]",,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,...,2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45463,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",...,2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0


In [46]:
movie.info() # let's have a look at the metadata

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [47]:
# we want to set the release_date as index
# converting them to datetime first
# errors="coerce" makes sure that wrong data types are set to NaT (not a timestamp)
pd.to_datetime(movie.release_date, errors="coerce")

0       1995-10-30
1       1995-12-15
2       1995-12-22
3       1995-12-22
4       1995-02-10
           ...    
45461          NaT
45462   2011-11-17
45463   2003-08-01
45464   1917-10-21
45465   2017-06-09
Name: release_date, Length: 45466, dtype: datetime64[ns]

In [48]:
# now we can reset the index
# first setting the index
# second dropping the release date
# overwriting the movie data while doing all this
movie = movie.set_index(pd.to_datetime(movie.release_date, errors = "coerce")).drop(columns = ["release_date"])

In [49]:
# sorting the index; gives chronological order to dates
movie.sort_index(inplace = True)

In [50]:
movie

Unnamed: 0_level_0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,production_countries,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
release_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1874-12-09,False,,0,"[{'id': 99, 'name': 'Documentary'}]",,315946,tt3155794,xx,Passage de Venus,Photo sequence of the rare transit of Venus ov...,...,"[{'iso_3166_1': 'FR', 'name': 'France'}]",0.0,1.0,"[{'iso_639_1': 'xx', 'name': 'No Language'}]",Released,,Passage of Venus,False,6.0,19.0
1878-06-14,False,,0,"[{'id': 99, 'name': 'Documentary'}]",,194079,tt2221420,en,Sallie Gardner at a Gallop,Sallie Gardner at a Gallop was one of the earl...,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",0.0,1.0,"[{'iso_639_1': 'xx', 'name': 'No Language'}]",Released,,Sallie Gardner at a Gallop,False,6.2,25.0
1883-11-19,False,,0,"[{'id': 99, 'name': 'Documentary'}]",,426903,tt5459794,en,Buffalo Running,Individual photographs of the running of a buf...,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",0.0,1.0,"[{'iso_639_1': 'xx', 'name': 'No Language'}]",Released,,Buffalo Running,False,5.4,7.0
1887-08-18,False,,0,"[{'id': 99, 'name': 'Documentary'}]",,159897,tt2075247,xx,Man Walking Around a Corner,The last remaining production of Le Prince's L...,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",0.0,1.0,"[{'iso_639_1': 'xx', 'name': 'No Language'}]",Released,,Man Walking Around a Corner,False,4.1,17.0
1888-01-01,False,,0,"[{'id': 99, 'name': 'Documentary'}]",,96882,tt1758563,xx,Accordion Player,The last remaining film of Le Prince's LPCCP T...,...,"[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}]",0.0,1.0,"[{'iso_639_1': 'xx', 'name': 'No Language'}]",Released,,Accordion Player,False,4.4,18.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
NaT,False,,0,[],,438910,tt0810384,ru,Konstruktor krasnogo tsveta -1993,Engineering Red - 1993 Dir: Andrey I. Y. Petr...,...,[],0.0,76.0,[],Released,,Engineering Red,False,6.0,2.0
NaT,False,,0,"[{'id': 9648, 'name': 'Mystery'}, {'id': 878, ...",,433711,tt3158690,en,All Superheroes Must Die 2: The Last Superhero,"In a no holds barred documentary, acclaimed jo...",...,[],0.0,74.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,All Superheroes Must Die 2: The Last Superhero,False,4.0,1.0
NaT,False,,0,[],,335251,tt1883368,en,The Land Where the Blues Began,An exploration of the musical and social origi...,...,[],0.0,0.0,[],Released,,The Land Where the Blues Began,False,0.0,0.0
NaT,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 80, 'name...",,449131,tt0321264,ru,Aprel,,...,"[{'iso_3166_1': 'RU', 'name': 'Russia'}]",0.0,,[],Released,,Aprel,False,6.0,1.0


In [51]:
# for our analysis we only need certain columns
df = movie.loc[:, ["title", "budget", "revenue"]].copy() # saving them to a dataFrame

In [52]:
df

Unnamed: 0_level_0,title,budget,revenue
release_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1874-12-09,Passage of Venus,0,0.0
1878-06-14,Sallie Gardner at a Gallop,0,0.0
1883-11-19,Buffalo Running,0,0.0
1887-08-18,Man Walking Around a Corner,0,0.0
1888-01-01,Accordion Player,0,0.0
...,...,...,...
NaT,Engineering Red,0,0.0
NaT,All Superheroes Must Die 2: The Last Superhero,0,0.0
NaT,The Land Where the Blues Began,0,0.0
NaT,Aprel,0,0.0


In [53]:
df.info() # shows us the budget column has mixed data types (object)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 45466 entries, 1874-12-09 to NaT
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   title    45460 non-null  object 
 1   budget   45466 non-null  object 
 2   revenue  45460 non-null  float64
dtypes: float64(1), object(2)
memory usage: 1.4+ MB


In [54]:
# converting the datatype in the budget column to numeric
df.budget = pd.to_numeric(df.budget, errors="coerce") # errors="coerce" once again

In [55]:
df

Unnamed: 0_level_0,title,budget,revenue
release_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1874-12-09,Passage of Venus,0.0,0.0
1878-06-14,Sallie Gardner at a Gallop,0.0,0.0
1883-11-19,Buffalo Running,0.0,0.0
1887-08-18,Man Walking Around a Corner,0.0,0.0
1888-01-01,Accordion Player,0.0,0.0
...,...,...,...
NaT,Engineering Red,0.0,0.0
NaT,All Superheroes Must Die 2: The Last Superhero,0.0,0.0
NaT,The Land Where the Blues Began,0.0,0.0
NaT,Aprel,0.0,0.0


In [56]:
df.info() # now we have float64 for budget and revenue

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 45466 entries, 1874-12-09 to NaT
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   title    45460 non-null  object 
 1   budget   45463 non-null  float64
 2   revenue  45460 non-null  float64
dtypes: float64(2), object(1)
memory usage: 1.4+ MB


In [57]:
df.describe() # summary statistics on the numerical columns
# shows we have large numbers in the millions

Unnamed: 0,budget,revenue
count,45463.0,45460.0
mean,4224579.0,11209350.0
std,17424130.0,64332250.0
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,0.0,0.0
max,380000000.0,2787965000.0


In [58]:
# it's of advantage to have large numbers in relation to millions
# therefore we devide budget and revenue by one million
df.iloc[:, -2:]  = df.iloc[:, -2:] / 1000000

In [59]:
df.describe() # number before the decimal point is in millions

Unnamed: 0,budget,revenue
count,45463.0,45460.0
mean,4.224579,11.209349
std,17.424133,64.332247
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,0.0,0.0
max,380.0,2787.965087


In [60]:
df.loc[df.title.isna()] # filtering for all movies where we have missing values in the title column

Unnamed: 0_level_0,title,budget,revenue
release_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NaT,,0.0,
NaT,,,
NaT,,0.0,
NaT,,,
NaT,,0.0,
NaT,,,


In [61]:
df.dropna(inplace=True) # drop the rows with missing values
# inplace=True saves the change

In [62]:
df.info() # check if dropping missing values has been successful

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 45460 entries, 1874-12-09 to NaT
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   title    45460 non-null  object 
 1   budget   45460 non-null  float64
 2   revenue  45460 non-null  float64
dtypes: float64(2), object(1)
memory usage: 1.4+ MB


In [63]:
# further analysing the dataFrame
df.budget.value_counts()
# budget for 36570 columns is zero

budget
0.000000     36570
5.000000       286
10.000000      259
20.000000      243
2.000000       242
             ...  
0.050663         1
0.000762         1
0.033500         1
0.235000         1
4.696772         1
Name: count, Length: 1223, dtype: int64

In [64]:
df.revenue.value_counts()
# revenue for 38052 columns is zero

revenue
0.000000      38052
12.000000        20
11.000000        19
10.000000        19
2.000000         18
              ...  
189.198313        1
304.320254        1
1.929168          1
25.605015         1
10.893246         1
Name: count, Length: 6863, dtype: int64

In [65]:
# for our analysis we need the revenue and the budget being greater than zero
# so, filtering the dataFrame
df = df.loc[(df.revenue > 0) & (df.budget > 0)]

In [66]:
df

Unnamed: 0_level_0,title,budget,revenue
release_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1915-02-08,The Birth of a Nation,0.100000,11.000000
1915-12-13,The Cheat,0.017311,0.137365
1916-12-24,"20,000 Leagues Under the Sea",0.200000,8.000000
1918-08-01,Mickey,0.250000,8.000000
1921-01-21,The Kid,0.250000,2.500000
...,...,...,...
2017-07-26,Atomic Blonde,30.000000,90.007945
2017-07-28,The Emoji Movie,50.000000,66.913939
2017-08-03,The Dark Tower,60.000000,71.000000
2017-08-03,Wind River,11.000000,184.770205


In [67]:
df.info() # we're down to 5381 entries

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5381 entries, 1915-02-08 to 2017-08-04
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   title    5381 non-null   object 
 1   budget   5381 non-null   float64
 2   revenue  5381 non-null   float64
dtypes: float64(2), object(1)
memory usage: 168.2+ KB


In [68]:
df.describe()

Unnamed: 0,budget,revenue
count,5381.0,5381.0
mean,31.094796,90.318123
std,40.162625,166.142264
min,1e-06,1e-06
25%,5.037,7.011317
50%,17.0,29.918745
75%,40.0,99.965753
max,380.0,2787.965087


In [69]:
df.sort_values("budget", ascending=False) # sorting by budget column from high to low
# gives us most expensive movie on top

Unnamed: 0_level_0,title,budget,revenue
release_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2011-05-14,Pirates of the Caribbean: On Stranger Tides,380.000000,1045.713802
2007-05-19,Pirates of the Caribbean: At World's End,300.000000,961.000000
2015-04-22,Avengers: Age of Ultron,280.000000,1405.403694
2006-06-28,Superman Returns,270.000000,391.081192
2012-03-07,John Carter,260.000000,284.139100
...,...,...,...
1987-11-06,Less Than Zero,0.000001,12.396383
2012-03-30,Aquí Entre Nos,0.000001,2.755584
1936-02-05,Modern Times,0.000001,8.500000
2003-08-15,Tere Naam,0.000001,0.000002


In [70]:
# same for revenue
df.sort_values("revenue", ascending = False)

Unnamed: 0_level_0,title,budget,revenue
release_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2009-12-10,Avatar,237.000000,2787.965087
2015-12-15,Star Wars: The Force Awakens,245.000000,2068.223624
1997-11-18,Titanic,200.000000,1845.034188
2012-04-25,The Avengers,220.000000,1519.557910
2015-06-09,Jurassic World,150.000000,1513.528810
...,...,...,...
2003-08-15,Tere Naam,0.000001,0.000002
1995-09-28,Mute Witness,0.000002,0.000001
1996-10-16,The Wind in the Willows,0.000012,0.000001
1925-08-26,The Merry Widow,0.000592,0.000001


In [71]:
# export our dataset to csv file
df.to_csv("../assets/data/bud_vs_rev.csv")