## Skating Score Project

### Introduction: 

   Historically, figure skating has been a sport inwhich the United States has greatly excelled in the Olympic Winter Games. Sports that are judged on both technical merit and artisitc expression are challenging to judge objectively and figure skating is no exception. In 2004, the previous highly subjective figure skating scoring system was replaced with the International Judging System(IJS) which takes into account the minutiae of every skating program awarding a specific point value based on multiple calculations. Following this change, the United States has seen a drastic decrease in international accomplishment for women skaters representing the USA. 

   The recently 2022 Olympic winter games in Beijing marks 4 consecutive Olympics inwhich the US women have not been awarded a medal. Many critics of current state of international figure skating suggest that the medal drought is directly related to Russian domianance in womens figure skating. Russian figure skating has been under scrutiny for their questionable training tactics and with a recently doping scandal at the 2022 Beijing Olympics, it's more than reasonable for all other skating federations to reject Russian figure skating training tactics and put the wellbeing of athletes ahead of competitive victories However, can the 15 year medal drought be completed contributed to this? 

   In the four most recent Winter Olympic Games, at least one of the women figure skating medals has gone to an athlete from Japan, South Korea, Italy, or Canada. It's also worth noting that the United States is continuing to excel greatly in most figure skating disciplines (especially mens and ice dancing). This project is a data-driven alaysis of this project inwhich I explore trends that may offer insights to improve US womens figure skating scores at the Olympic Games. 

### Project Goals:

- Construct a machine learning regression model that improves predicted Olympic scores of women figure skaters under the International Judging System (implemented in 2004).
- Find the key drivers of Olympic event scores by anaylizing competition data of athletes prior to their Olympic performances.
- Empower US figure skating athletes and coaches with information that may lead to positive training modifications.
- Thoroughly document the process and key findings.
- Prove the potentiality of utilizing the data science pipeline to better the sport of figure skating.

### Summary of Findings & Recommendations:
- My analysis indicates that the top drivers of tax assessed home values are:
     > - property size (square footage)
     > - property's year built
     > - bedroom count
     > - bathroom count

- I built and trained a Polynomial Regression model which is able to improve predicted tax assessed home values by ~ $80,000 (22% improved from baseline/previous predictions).
 
- By utilizing this model with the top drivers of tax assessed home values , I can recommend employing this new model with reasonable confidence.

### Data Acquisition & Preparation
- Import necessary libraries
- Import user defined functions (acquire.py, wrangle.py)
- Data imported meets the following conditions:
    > - Single family homes in Orange county, CA, Ventura, CA, or Los Angeles, CA
    > - Had a transaction in 2017
    > - Data available in zillow properties_2017 table

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import requests
import acquire
import prepare
import os
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

from math import sqrt
from scipy import stats

import sklearn.preprocessing
from sklearn.model_selection import train_test_split
from scipy.stats import pearsonr, spearmanr

from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans

from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.impute import SimpleImputer

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', 300)

from IPython.display import Markdown, display

np.random.seed(123)

In [7]:
df = acquire.get_competition_data()
# this is a user-defined function in acquire.py that pulls in selected data from skatingscore.com

In [8]:
df.info()
# shows a snapshot of all data/columns that may potentially be used prior to data wrangling.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1860 entries, 0 to 1859
Data columns (total 41 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   1860 non-null   int64  
 1   Skater       1860 non-null   object 
 2   Nat_x        1860 non-null   object 
 3   SP           1103 non-null   object 
 4   SP.1         1860 non-null   float64
 5   SP.2         1860 non-null   float64
 6   FS           1065 non-null   object 
 7   FS.1         1858 non-null   float64
 8   FS.2         1858 non-null   float64
 9   Total        1085 non-null   object 
 10  Total.1      1860 non-null   object 
 11  season_x     1860 non-null   int64  
 12  skater_name  1860 non-null   object 
 13  first_name   1860 non-null   object 
 14  #_x          1860 non-null   float64
 15  Nat_y        1860 non-null   object 
 16  Combo Jump   1860 non-null   object 
 17  Solo Jump    1859 non-null   object 
 18  Axel         1860 non-null   object 
 19  TES_x 

113.64    1
110.64    1
76.00     1
64.86     1
84.64     1
93.04     1
82.86     1
102.56    1
101.74    1
102.72    1
117.12    1
111.14    1
97.68     1
78.18     1
77.54     1
72.34     1
81.64     1
89.98     1
89.32     1
91.64     1
94.60     1
94.76     1
97.56     1
77.88     1
Name: QB, dtype: int64

### Initial Data - 
I strategically acquired data on international-level women figure skaters over since 2004. In the next steps, I will label columns much more clearly so those without specific domain knowledge will be able understand this anaylsis much easier. TES, PCS, TSS, etc. are abbreviated parts of the skating score and as I acquired and joined the data together, we can see that there are many duplicate columns. Because the target variable is athletes' final Olympic score ("oly_event_score"), I am filtering out all records that do no belong to an Olympian. After this, there are 102 total records. One record represents a skater's olympic results from one of the five Winter Olympic Games between 2006-2022 and information about that skater's international competitive history the 4 years preceding said Olympics. It's important to note that there will be duplicate skater names in the database as some skaters have performed at more than one Olympics. 

Only major international events are included at this time as these are likely to be the most representive of how a skater may perform at the Olympics considering the high pressure environment. It is certainly possible to expand on this exploration and modeling process in the future by adding skaters' scores from national level competitions and additional international events. 

In [5]:
df = prepare.prepare_competition_data(df)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102 entries, 0 to 21
Data columns (total 40 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   skater_name                 102 non-null    object 
 1   event_final_place           102 non-null    float64
 2   short_score                 102 non-null    float64
 3   short_place                 102 non-null    float64
 4   free_score                  102 non-null    float64
 5   free_place                  102 non-null    float64
 6   event_score                 102 non-null    float64
 7   short_elements_score        102 non-null    float64
 8   short_elements_rank         102 non-null    float64
 9   short_components_score      102 non-null    float64
 10  short_components_rank       102 non-null    float64
 11  free_elements_score         102 non-null    float64
 12  free_elements_rank          102 non-null    float64
 13  free_components_score       102 non-

### In this project, events included are:

- Grand Prix Qualifiers (America, Canada, France, Japan, Russia, China)
- Grand Prix Final
- World Championships (because the World Championships take place after the Olympics in a given season, the world championship scores are included as taking place the season after they do. 
- Olympic Winter Games (specific competition data is included for the exploratory process only. The goal of modeling in this project is to predict Olympic scores based on competitive history so all Olympic data aside from the target variable will be dropped pre-modeling).

### Non-Olympic scores in the 4 year period preceding the Olympics have been averaged together for each record in the following categories:

- short program/free program/final event place
- short program/free program/final event score
- short program/free program/final components score
- short program/free program/final elements score
- average errors including deductions(falls and/or major error), under rotation jump error, costly jump error, major combination jump error, jump downgrade, illegal element, suspected errors, and all jump errors
- average difficult jumping elements including quads, triple axels, and triple triple (note- this accounts for attempted jumps only)