# CAPSTONE
___
## Movies
___
### Data Sets
* IMDB
* Rotten Tomatoes
* Metacritic
### Questions
* What makes a movie "good"?
    * Domestic Box Office Income w/inflation accounted for.
    * Budget
    * Critic/User Ratings
    * Commom Actors/Actresses/Directors
    * Awards
* I also want to look at the worst rated movies that made the most money and compare those to the highest rated movies that made the least.
* Look at how some of my favorite movies compared to list. (Departed, Interstellar, Gladiator, Mollys Games)
___
## Golf
___
### Data Sets
* PGA
* LPGA
### Questions
* How has the improvement in equipment advanced the game?
    * Golf ball Comparison
    * Clubs
    * Spin rates
* The Tiger Woods effect on the game.
* Look at how much scores have lowered over the years.
* Does driving distance equate to lower scores?


In [None]:
# Clean/Filter data to find US based shows/movies only

import pandas as pd

# Online Cleaned data location
# usmovies = pd.read_csv("https://raw.githubusercontent.com/natep514/Capstone/main/Resources/USMovies.csv", delimiter="\t")

# macOS file location
# usmovies = pd.read_csv("/Users/natepatten/code/savvycoders/Resources/title.akas.tsv", delimiter="\t")

# WindowsOS file location
usmovies = pd.read_csv(r"C:\Users\Nate\code\savvycoders\Resources\title.akas.tsv", delimiter="\t", low_memory=False)

# Drop columns not need
usmovies.drop(["ordering","isOriginalTitle", "types", "language", "attributes"], axis=1, inplace=True)

# Replace \N with 0
usmovies.replace('\\N', "0", inplace=True)

# Filter to get only US based movies/shows.
usmovies = usmovies[usmovies["region"] == "US"]

# Drop column region as it is no longer needed.
usmovies.drop("region", axis=1, inplace=True)

# Removed comment to save data
# usmovies.to_csv(r"C:\Users\Nate\code\savvycoders\Capstone\Resources\USMovies.csv", sep="\t")

# Test print results
print(usmovies.head(12))



In [None]:
# Clean/Filter data to find Movies between 2000 & 2020

import pandas as pd

# Online Cleaned data location
# moviesfilter = pd.read_csv("https://raw.githubusercontent.com/natep514/Capstone/main/Resources/MoviesFilter.csv", delimiter="\t")

# MacOS File Location
# moviesfilter = pd.read_csv("/Users/natepatten/code/savvycoders/Resources/title.basics.tsv", delimiter="\t")

# WindosOS File Location
moviesfilter = pd.read_csv(r"C:\Users\Nate\code\savvycoders\Resources\title.basics.tsv", delimiter="\t", index_col=0, low_memory=False)

# Replace \N with 0
moviesfilter.replace("\\N", "0", inplace=True)

# Convert 'startYear' & 'isAdult' columns to integer dtype
moviesfilter[["startYear", "isAdult"]] = moviesfilter[["startYear", "isAdult"]].astype(int)

# Select movies between 2000 & 2020 that aren't adult
moviesfilter = moviesfilter[moviesfilter["titleType"] == "movie"]
moviesfilter = moviesfilter[moviesfilter["startYear"] >= 2000] 
moviesfilter = moviesfilter[moviesfilter["startYear"] <= 2020]
moviesfilter = moviesfilter[moviesfilter["isAdult"] == 0]

# Drop columns not need
moviesfilter.drop(['originalTitle','isAdult', 'titleType', 'endYear'], axis=1, inplace=True)

# Save the filtered data to a CSV file (uncomment if desired)
# moviesfilter.to_csv(r"C:\Users\Nate\code\savvycoders\Capstone\Resources\MoviesFilter.csv", sep="\t")

# Test print results
print(moviesfilter.head(12))

In [None]:
# Clean/Filter data to find movie ratings

import pandas as pd

# Online Cleaned data location
# moviesfilter = pd.read_csv("https://raw.githubusercontent.com/natep514/Capstone/main/Resources/imdbratings.csv", delimiter="\t")

# MacOS File Location
# imdbratings = pd.read_csv("/Users/natepatten/code/savvycoders/Resources/title.ratings.tsv", delimiter="\t")

# WindowOS File Location
imdbratings = pd.read_csv(r'C:\Users\Nate\code\savvycoders\Resources\title.ratings.tsv', delimiter="\t")

# Save the filtered data to a CSV file (uncomment if desired)
# imdbratings.to_csv(r"C:\Users\Nate\code\savvycoders\Capstone\Resources\imdbratings.csv", sep="\t")

# Test print results
print(imdbratings.head(12))

In [None]:
# Table Scrapper loop

import pandas as pd

# URL to scrape
for year in range(2021, 2023):
    url= f'https://www.boxofficemojo.com/year/{year}/?grossesOption=calendarGrosses'
    scraper = pd.read_html(url)
# Save the filtered data to a CSV file (uncomment if desired)   
    scraper[0].to_csv(f'BoxOffice{year}.csv', index=False)

# View 
print(scraper[0])


In [None]:
# Save all scraped csv's into a single file

# Import modules
import pandas as pd
import glob
import os

# Path of files, selects all file with BoxOffice name
files = os.path.join('C:\\Users\\Nate\\code\\savvycoders\\Capstone\\Resources\\', 'BoxOffice*.csv')
file = glob.glob(files)
df = pd.concat(map(pd.read_csv, file), ignore_index=True)

# Save the filtered data to a CSV file (uncomment if desired) 
df.to_csv(r'C:\Users\Nate\code\savvycoders\Capstone\Resources\BoxOfficeAll.csv', sep='\t')


In [None]:
import pandas as pd

BoxOfficeFilter = pd.read_csv(r'C:\Users\Nate\code\savvycoders\Capstone\Resources\BoxOfficeAll.csv', delimiter='\t', index_col=[0,1])

BoxOfficeFilter.drop(['Rank', 'Genre', 'Budget', 'Running Time','Release Date', 'Theaters', 'Estimated', 'Gross'], axis=1, inplace=True)

BoxOfficeFilter.rename(columns={"Total Gross": "Gross"}, inplace=True)

BoxOfficeFilter['Gross'] = BoxOfficeFilter['Gross'].replace({'\$': '', ',': ''}, regex=True).astype(int)

BoxOfficeFilter.to_csv(r'C:\Users\Nate\code\savvycoders\Capstone\Resources\BoxOfficeFilter.csv', sep='\t')

print(BoxOfficeFilter.head(12))