# Assignment 8

## Submit your .ipynb file to Gradescope by Tuesday, November 11th **by 10pm**

#### Import necessary libraries:

In [1]:
import statsmodels.formula.api as smf
import pandas as pd 

##### Run the code cell below to read the CSV file named `results.csv` in the `data` folder and print the first 5 rows of the dataset (using a quick alternative to `.iloc[:5,:]`). Browse the dataset. (We've seen this file before, it's part of the Formula One racing dataset)

In [2]:
df_results = pd.read_csv("data/results.csv")
display(df_results.head())

Unnamed: 0,resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId
0,1,18,1,1,22,1,1,1,1,10.0,58,1:34:50.616,5690616,39,2,1:27.452,218.3,1
1,2,18,2,2,3,5,2,2,2,8.0,58,+5.478,5696094,41,3,1:27.739,217.586,1
2,3,18,3,3,7,7,3,3,3,6.0,58,+8.163,5698779,41,5,1:28.090,216.719,1
3,4,18,4,4,5,11,4,4,4,5.0,58,+17.181,5707797,58,7,1:28.603,215.464,1
4,5,18,5,1,23,3,5,5,5,4.0,58,+18.014,5708630,43,1,1:27.418,218.385,1


### (1) Linear Regression

Consider the data in `results.csv` loaded above. We might guess that there is a simple linear relationship between the number of points scored by a driver ("points"), the number of laps completed by the driver ("laps"), and the driver's starting position ("grid"). In particular, a better starting position and a larger number of completed laps should lead to a higher number of points.

In particular, we expect that these three variables are related by the approximate equality:

$$ p_i \approx a\cdot \ell_i + b\cdot g_i + c$$

where:

- $i$ is the index of the result (row $i$ of the DataFrame)
- $p_i$ is the number of points scored in result $i$
- $\ell_i$ is the number of laps completed in result $i$
- $g_i$ is the starting "grid" position of result $i$.
- $a$, $b$, and $c$ are the the coefficients of the linear model we need to determine.

In this model, we say that "laps" and "grid" are both **independent variables** and "points" is the **dependent variable**.

- ~~Use ``smf.ols`` (Ordinary Least Squares) to compute the 3 coefficients of the model by running the following line of code (the names "independent_var/dependent_var1/dependent_var2" are placeholder names - substitute the appropriate column names from ``df_results``)~~:
```python
        # typo in original!!
```

- Use ``smf.ols`` (Ordinary Least Squares) to compute the 3 coefficients of the model by running the following line of code (the names "dependent_var/independent_var1/independent_var2" are placeholder names - substitute the appropriate column names from ``df_results``):
```python
model = smf.ols(formula = "dependent_var ~ independent_var1  + independent_var2", data = df_results)
```

- Compute a Pandas ``Series`` containing the computed coefficients $a$, $b$, and $c$ from the linear model. You can do this as follows:
```python
        coeffs = model.fit().params
```
- Assign the entries of ``coeffs`` to 3 floating-point variables ``a``, ``b``, ``c`` corresponding to the coefficients of the model.

- Define two variables ``grid_val = 3`` and ``laps_val = 55``, which represent the position (grid) and completed laps, respectively, for a given driver. Using these values and the model coefficients, calculate how many points the linear model predicts for this driver, assigning it to the variable ``predicted_points``

In [None]:
# your answer here

model = smf.ols(formula="points ~ laps + grid", data=df_results)
coeffs = model.fit().params
a, b, c = coeffs["laps"], coeffs["grid"], coeffs["Intercept"]
grid_val = 3
laps_val = 55

predicted_points = a * laps_val + b * grid_val + c

print("Coefficients:")
print(f"a (laps coefficient): {a}")
print(f"b (grid coefficient): {b}")
print(f"c (intercept): {c}")
print(f"\nPredicted points for grid={grid_val}, laps={laps_val}: {predicted_points}")

Coefficients:
a (laps coefficient): 0.03927861016041546
b (grid coefficient): -0.2247957790519166
c (intercept): 2.5841263931430722

Predicted points for grid=3, laps=55: 4.070062614810173


##### Run the code cell below to read the CSV file named `imdb_top_1000.csv` in the `data` folder and print the first 5 rows of the dataset (using a quick alternative to `.iloc[:5,:]`). Browse the dataset. (These are the 1000 movies with highest score on imdb.com as of 5 years ago)

##### We will use this dataset for the remaining problems.

In [8]:
df_movies = pd.read_csv("data/imdb_top_1000.csv")
display(df_movies.head())

Unnamed: 0,Name,Released_Year,Length,Genre,IMDB_Rating,Overview,Director,Star1,Star2,Gross
0,The Shawshank Redemption,1994,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,Frank Darabont,Tim Robbins,Morgan Freeman,28341469
1,The Godfather,1972,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,Francis Ford Coppola,Marlon Brando,Al Pacino,134966411
2,The Dark Knight,2008,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,Christopher Nolan,Christian Bale,Heath Ledger,534858444
3,The Godfather: Part II,1974,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,Francis Ford Coppola,Al Pacino,Robert De Niro,57300000
4,12 Angry Men,1957,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,Sidney Lumet,Henry Fonda,Lee J. Cobb,4360000


### (2) Groupby + Aggregate Statistics

Create a new DataFrame by grouping the "movies" dataset by **"Genre"** and then computing the following aggregate statistics for **"IMDB_Rating"**:

- **Mean**
- **Standard Deviation**
- **Minimum**
- **Maximum**

Finally sort the DataFrame by the mean IMDB rating in **descending order**. Save the final sorted DataFrame as ``movies_agg``, and display (or print) it as output.

In [16]:
# your answer here
movies_agg = (
    df_movies
    .groupby("Genre")["IMDB_Rating"]
    .agg(["mean", "std", "min", "max"])  
    .sort_values(by="mean", ascending=False) 
)

display(movies_agg)


Unnamed: 0_level_0,mean,std,min,max
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Animation, Drama, War",8.50,,8.5,8.5
"Drama, Musical",8.40,,8.4,8.4
"Action, Sci-Fi",8.40,0.360555,8.0,8.7
"Drama, Mystery, War",8.35,0.070711,8.3,8.4
Western,8.35,0.420317,7.8,8.8
...,...,...,...,...
"Adventure, Comedy, War",7.60,,7.6,7.6
"Animation, Comedy, Crime",7.60,,7.6,7.6
"Action, Adventure, Family",7.60,,7.6,7.6
"Animation, Drama, Romance",7.60,,7.6,7.6


### (3) Merge Data

Create a new DataFrame ``movies_merge`` by merging the aggregate information <br> 
from (2) into ``df_movies``. Treat ``df_movies`` as both the primary dataset and the "left" dataset.

Display (or print) the resulting DataFrame.

In [11]:
# your answer here
movies_merge = pd.merge(
    df_movies,         
    movies_agg,        
    on="Genre",       
    how="left"       
)

display(movies_merge)


Unnamed: 0,Name,Released_Year,Length,Genre,IMDB_Rating,Overview,Director,Star1,Star2,Gross,mean,std,min,max
0,The Shawshank Redemption,1994,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,Frank Darabont,Tim Robbins,Morgan Freeman,28341469,7.975294,0.305468,7.6,9.3
1,The Godfather,1972,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,Francis Ford Coppola,Marlon Brando,Al Pacino,134966411,8.157692,0.448262,7.6,9.2
2,The Dark Knight,2008,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,Christopher Nolan,Christian Bale,Heath Ledger,534858444,7.880000,0.316664,7.6,9.0
3,The Godfather: Part II,1974,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,Francis Ford Coppola,Al Pacino,Robert De Niro,57300000,8.157692,0.448262,7.6,9.2
4,12 Angry Men,1957,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,Sidney Lumet,Henry Fonda,Lee J. Cobb,4360000,8.157692,0.448262,7.6,9.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Breakfast at Tiffany's,1961,115 min,"Comedy, Drama, Romance",7.6,A young New York socialite becomes interested ...,Blake Edwards,Audrey Hepburn,George Peppard,,7.877419,0.257824,7.6,8.6
996,Giant,1956,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,George Stevens,Elizabeth Taylor,Rock Hudson,,7.980000,0.363318,7.6,8.4
997,From Here to Eternity,1953,118 min,"Drama, Romance, War",7.6,"In Hawaii in 1941, a private is cruelly punish...",Fred Zinnemann,Burt Lancaster,Montgomery Clift,30500000,8.025000,0.368556,7.6,8.5
998,Lifeboat,1944,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,,8.073333,0.276371,7.6,8.6


### (4) Renaming Columns

Create a new DataFrame ``movies_rename`` by renaming the following columns in ``df_movies``:

- "Name" --> **"Movie_Title"**
- "Length" --> **"Runtime"**
- "Gross" --> **"Revenue"**

Verify that this renaming worked by printing the column names of both the old DataFrame ("df_movies") and the new DataFrame ("movies_rename").

In [17]:
# your answer here
movies_rename = df_movies.rename(
    columns={
        "Name": "Movie_Title",
        "Length": "Runtime",
        "Gross": "Revenue"
    }
)

print("\nRenamed movies_rename columns:")
print(movies_rename.columns.tolist())


Renamed movies_rename columns:
['Movie_Title', 'Released_Year', 'Runtime', 'Genre', 'IMDB_Rating', 'Overview', 'Director', 'Star1', 'Star2', 'Revenue']


### (5) Split Data

Use ".query()" to split the ``movies_rename`` dataset from (4) into different parts:

- Those directed by **David Lynch**, saved to a DataFrame called ``movies_lynch``
- Those directed by **Stanley Kubrick**, saved to a DataFrame called ``movies_kubrick``

Display/print both DataFrames to the screen.

In [18]:
# your answer here

movies_lynch = movies_rename.query('Director == "David Lynch"')
movies_kubrick = movies_rename.query('Director == "Stanley Kubrick"')

display(movies_lynch)
display(movies_kubrick)


Unnamed: 0,Movie_Title,Released_Year,Runtime,Genre,IMDB_Rating,Overview,Director,Star1,Star2,Revenue
276,The Elephant Man,1980,124 min,"Biography, Drama",8.1,A Victorian surgeon rescues a heavily disfigur...,David Lynch,Anthony Hopkins,John Hurt,
385,The Straight Story,1999,112 min,"Biography, Drama",8.0,An old man makes a long journey by lawnmower t...,David Lynch,Richard Farnsworth,Sissy Spacek,6203044.0
515,Mulholland Dr.,2001,147 min,"Drama, Mystery, Thriller",7.9,After a car wreck on the winding Mulholland Dr...,David Lynch,Naomi Watts,Laura Harring,7220243.0
834,Blue Velvet,1986,120 min,"Drama, Mystery, Thriller",7.7,The discovery of a severed human ear found in ...,David Lynch,Isabella Rossellini,Kyle MacLachlan,8551228.0
961,Lost Highway,1997,134 min,"Mystery, Thriller",7.6,Anonymous videotapes presage a musician's murd...,David Lynch,Bill Pullman,Patricia Arquette,3796699.0


Unnamed: 0,Movie_Title,Released_Year,Runtime,Genre,IMDB_Rating,Overview,Director,Star1,Star2,Revenue
73,The Shining,1980,146 min,"Drama, Horror",8.4,A family heads to an isolated hotel for the wi...,Stanley Kubrick,Jack Nicholson,Shelley Duvall,44017374.0
78,Dr. Strangelove or: How I Learned to Stop Worr...,1964,95 min,Comedy,8.4,An insane general triggers a path to nuclear h...,Stanley Kubrick,Peter Sellers,George C. Scott,275902.0
80,Paths of Glory,1957,88 min,"Drama, War",8.4,"After refusing to attack an enemy position, a ...",Stanley Kubrick,Kirk Douglas,Ralph Meeker,
104,Full Metal Jacket,1987,116 min,"Drama, War",8.3,A pragmatic U.S. Marine observes the dehumaniz...,Stanley Kubrick,Matthew Modine,R. Lee Ermey,46357676.0
113,A Clockwork Orange,1971,136 min,"Crime, Drama, Sci-Fi",8.3,"In the future, a sadistic gang leader is impri...",Stanley Kubrick,Malcolm McDowell,Patrick Magee,6207725.0
114,2001: A Space Odyssey,1968,149 min,"Adventure, Sci-Fi",8.3,After discovering a mysterious artifact buried...,Stanley Kubrick,Keir Dullea,Gary Lockwood,56954992.0
281,Barry Lyndon,1975,185 min,"Adventure, Drama, History",8.1,An Irish rogue wins the heart of a rich widow ...,Stanley Kubrick,Ryan O'Neal,Marisa Berenson,
441,The Killing,1956,84 min,"Crime, Drama, Film-Noir",8.0,Crook Johnny Clay assembles a five man team to...,Stanley Kubrick,Sterling Hayden,Coleen Gray,
549,Spartacus,1960,197 min,"Adventure, Biography, Drama",7.9,The slave Spartacus leads a violent revolt aga...,Stanley Kubrick,Kirk Douglas,Laurence Olivier,30000000.0
