In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Levy Sahoo
- Keenan Serrao
- Leela Stuepfert
- Maya Ammar

# Research Question

Do socioeconomic factors such as GDP and population count correlate to a nation’s successive medal performance at the Olympic games?


- Does a country’s GDP solely correlate with total medal count in the 2024 Olympics?
    - Are there any other significant factors that influence a nation’s medal success in the Olympics?
    - Do features such as country population influence the size of the Olympic team further influencing the likelihood of the nation's ability to perform better? 


## Background and Prior Work

The Olympic games have been deemed as notably the most prestigious sporting event in human history, it’s a collective movement where all the best athletes in the world assemble together for one thing in mind: Glory for their nation. The way to achieve this is to attain a medal. Although it’s fair to say that the Olympic games are a stage that gathers the best of the best on the same platform to evaluate their performances in their respective events, we need to reconsider if there’s actually more to what meets the eye test. Instead of simply asking who will find a spot on the podium, seeing the journeys of the athletes with various backgrounds shifts the focus of the question to how did they arrive to the opportunity to compete for a shot at the podium, and are there quantitative features that we can use to assess regarding their country of origin and socioeconomic backgrounds to help us answer the begging question: 

**Do economic factors such as GDP and population count correlate to a nation’s successive medal performance at the Olympic games?** 

**“What do other research studies and deem about factors correlated to a nation’s Olympic Success?”**

A 2000 study<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_ref-1) conducted by economists Andrew Bernard and Meghan Busse, explored how economic factors such as GDP and population size impact a nation’s Olympic success. The resport<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_ref-2) summarized by Michael Klien, Professor of International Economic Affairs at Tufts University, identified real GDP as the most significant predictor of a country's medal count, estimating that a 10% increase in income per capita yields a 6.9% increase in medals, assuming population remains constant. Similarly, a 10% population increase, while holding income steady, leads to a 3.6% rise in medals. These findings suggest a clear correlation between economic scale and Olympic performance, providing a foundation for the relevance of these variables in predicting success. On the contrary, some researchers dispute the validity of these regressors upon a nation’s successive medal performance, a more recent 2022 study<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_ref-3) questions the significance of GDP and population on Olympic outcomes, finding that variables such as GDP size, corruption ranking, athlete count, and topography don’t significantly impact medal standings. Instead, factors like inflation rates, economic activity, and income classification seem to offer alternative perspectives on predicting Olympic success. This work introduces a broader range of variables, which can contribute to refining model regressors and understanding potential limitations of traditional economic measures. Since some of these features aren’t open to public availability, such metrics can deliver insights towards the interpretation of the error terms of the linear regression models used to analyze our objective. A separate Georgia Tech analysis<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_ref-4) identifies other influential factors, such as country size and healthcare expenditure per capita, in determining Olympic performance. By highlighting healthcare investment as a predictor, it expands beyond purely economic indicators, suggesting that a nation’s healthcare infrastructure may play a vital role in supporting elite athletic development.

While each study offers valuable insights, they present varying perspectives on the significance of GDP and population as predictors of Olympic success, it’s not definitive. Researchers may emphasize GDP and population as core predictors, while others cast doubt on their significance, proposing additional variables such as inflation and economic activity. Georgia Tech’s emphasis on healthcare investment introduces yet another dimension. This divergence highlights the complexity of determining Olympic success factors and suggests that GDP and population, though potentially impactful, may not be exhaustive predictors. By integrating these varied viewpoints, our project can better assess the significance of socioeconomic regressors with common data available as well as analyze the combined effect of these and other regressors, using econometric modeling to explore their potential in predicting Olympic medal ranking success while considering possible model limitations and interpreting the error sources.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Bernard, A. (1 Dec 2002) Who Wins the Olympic Games: Economic Resources and Medal Totals *Review of Economics and Statistics*. https://faculty.tuck.dartmouth.edu/images/uploads/faculty/andrew-bernard/olymp60restat_finaljournalversion.pdf 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Klein, M. (17 Jul 2024) What Determines Countries’ Olympic Success? *ECONOFACT* https://econofact.org/what-determines-countries-olympic-success
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Sasha, W. (28 May 2022) Assessment of Olympic performance in relation to economic, demographic, geographic, and social factors: quantile and Tobit approaches *Taylor & Francis Online*. https://www.tandfonline.com/doi/full/10.1080/1331677X.2022.2080735#abstract
3. <a name="cite_note-4"></a> [^](#cite_ref-4) Boudreau, J. The Miracle on Thin Ice: How A Nation's GDP Affects its Olympic Performance *Georgia Tech University*. https://repository.gatech.edu/server/api/core/bitstreams/1aa2b537-c3de-4177-8295-3fcd3a03a965/content#:~:text=We%20estimate%20that%20GDP%20per,bronze%20medals%20a%20country%20receives


# Hypothesis


Null hypothesis (H<sub>0</sub>): There is no significant correlation between a country’s GDP per capita or population size and its Olympic medal count. This suggests that countries with a higher GDP per capita and larger population size do not affect the number of medals a country earns.


Alternative hypothesis (H<sub>a</sub>): There is a positive correlation between a country’s GDP per capita or population size and its Olympic medal count. This means that countries with a higher GDP per capita and larger population size are likely to win more medals, as increased population and resources may improve athlete development and access to training facilities. 


Reasoning: We aim to test the assumption that wealthier countries allocate more resources to athletic development, training and facilities potentially leading to more Olympic success. Additionally, countries with a larger population have a greater pool of potential athletes, increasing the likelihood of finding more talent. Our hypothesis will allow us to determine whether these economic and demographic advantages contribute to Olympic success or if other factors might be just as influential.  

## Dataset Explanations overview


#### Dataset #1:
  - Dataset Name: Summer Olympic Medals 1896 - 2020
  - Link to the dataset: https://www.kaggle.com/datasets/ramontanoeiro/summer-olympic-medals-1986-2020 
  - Number of observations: 1344
  - Number of variables: 8 

"Summer Olympic Medals 1896–2020," is a comprehensive historical record of medals awarded in every Summer Olympic Games from 1896 to 2020, also sourced from Kaggle. Each row provides data on a specific country’s performance in a given Olympic year.

Dataset columns include:

- `Year`
- `Host_country`
- `Host_city`,
- `Country_Name`,
- `Country_Code`

and medal counts of:

- `Gold`
- `Silver`
- `Bronze`

This dataset contains thousands of observations spanning numerous Olympic editions, enabling us to analyze trends in Olympic performance over time. While this dataset does not include economic indicators, it will serve as a foundation for historical analysis of country specific medal achievements, which can later be compared with economic data for a more in depth exploration of trends.

#### Dataset #2:

  - Dataset Name: GDP by Country 1999 - 2022
  - Link to the dataset: https://www.kaggle.com/datasets/alejopaullier/-gdp-by-country-1999-2022
  - Number of observations: 180
  - Number of variables: 24

"GDP by Country 1999–2022," offers annual GDP data for all countries worldwide, covering the years:
`1999`
...
`2022` 
and the GDPs measured in billions of US dollars. It is structured with rows for each `country` and columns for each `year`, which makes it ideal for tracking economic growth and fluctuations over time. Each cell represents a `country`’s GDP for a specific `year`, allowing for both cross-country comparisons within a single year and longitudinal analysis within a single `country` over multiple years. When this data is merged with the Olympic medal datasets, it will allow us to investigate if economic factors, such as GDP growth or decline, correlate with Olympic success across different time periods.


#### Dataset #3: 
  - Dataset Name: 2024 Olympic Medals vs GDP
  - Link to the dataset: https://www.kaggle.com/datasets/ernestitus/2024-olympics-medals-vs-gdp
  - Number of observations: 90
  - Number of variables: 10

"2024 Olympics: Medals vs GDP," is sourced from Kaggle and modified from Mohamed Yosef’s “2024 Olympics Medals and Economic Status.” It provides data on the performance and economic indicators of countries participating in the 2024 Olympics. This dataset includes information for 90 countries, with each row representing a country and columns detailing attributes such as:

- (`country`) Name of Country
- `country_code`, 
- `region` (e.g., Europe, Asia). 

Olympic performance is recorded through: 

- `gold`
- `silver`
- `bronze`
- `total` medal counts. 

Economic data fields include `gdp`, `gdp_year` (latest GDP data), and `population`, allowing for a thorough exploration of potential correlations between a `country`’s economic profile and its Olympic performance.

## Dataset #1: Summer Olympic Medals 1896 - 2020

#### Cleaning Process Explanation: 

In data set we identified missing values using `df.isnull().sum()` and determined that the missing entries in non-critical columns could be safely ignored without impacting the analysis. Rows with missing values in essential columns (`Country`, `Year`, and `Medal`) were dropped using `df.dropna(subset=['Country', 'Year', 'Medal'])`. To ensure that there was consistency across datasets, we standardized country names using a mapping dictionary (e.g., `{'United States': 'USA', 'Great Britain': 'UK'}`). This mapping was applied with `df['Country'].replace(mapping_dict)`. The Medal column was transformed by creating binary columns for Gold, Silver, and Bronze. This allowed for aggregating medal counts more efficiently. The data was then grouped by Country and Year, summing up medals with `df.groupby(['Country', 'Year']).sum()`. This provided a view of total medals won by each country per Olympic year, which is relevant for our research question. It is important to note that for easier analysis, log-transforming medal counts were considered if they displayed skewed distributions, given that extreme values (from countries with high medal counts) could affect correlation results. Finally, to remain parallel with data from the GDPs Dataset, we filter `medals_cleaned`to consider `[medals_cleaned['Year'] >= 1999]`

Start by loading the dataset and taking a look at the first few rows as well as the column names and types.

In [6]:
medals = pd.read_csv('Summer_olympic_Medals.csv')
medals.head()

Unnamed: 0,Year,Host_country,Host_city,Country_Name,Country_Code,Gold,Silver,Bronze
0,1896,Greece,Athens,Great Britain,GBR,2,3,2
1,1896,Greece,Athens,Hungary,HUN,2,1,3
2,1896,Greece,Athens,France,FRA,5,4,2
3,1896,Greece,Athens,United States,USA,11,7,2
4,1896,Greece,Athens,Germany,GER,6,5,2


In [7]:
medals.dtypes

Year             int64
Host_country    object
Host_city       object
Country_Name    object
Country_Code    object
Gold             int64
Silver           int64
Bronze           int64
dtype: object

Next, look for null values in the dataset

In [8]:
# Count number of nans in each column
medals.isnull().sum()

Year             0
Host_country     0
Host_city        0
Country_Name     0
Country_Code    86
Gold             0
Silver           0
Bronze           0
dtype: int64

In [9]:
# Display rows with nans
medals[medals.isnull().any(axis=1)].head()

Unnamed: 0,Year,Host_country,Host_city,Country_Name,Country_Code,Gold,Silver,Bronze
1165,2016,Brazil,Rio de Janeiro,Denmark,,2,6,7
1166,2016,Brazil,Rio de Janeiro,Argentina,,3,1,0
1167,2016,Brazil,Rio de Janeiro,Sweden,,2,6,3
1168,2016,Brazil,Rio de Janeiro,Ukraine,,2,5,4
1169,2016,Brazil,Rio de Janeiro,South Africa,,2,6,2


As we can see below, we only have null country codes from the year 2016. This can be fixed by filling in the values with country codes from previous years.

In [10]:
medals[medals.isnull().any(axis=1)]['Year'].unique()

array([2016])

Fill in null country codes with values from previous years

In [11]:
country_code_map = dict(zip(medals['Country_Name'], medals['Country_Code']))
medals_cleaned = medals.assign(Country_Code = medals['Country_Name'].map(country_code_map).fillna(medals['Country_Code']))


Create a new column called `Total_Medals` that sums the total number of medals won by each country in each year.

In [44]:
medals_cleaned['Total_Medals'] = medals_cleaned['Gold'] + medals_cleaned['Silver'] + medals_cleaned['Bronze']

Recheck for null values

In [45]:
medals_cleaned.isnull().sum()

Year             0
Host_country     0
Host_city        0
Country_Name     0
Country_Code    11
Gold             0
Silver           0
Bronze           0
Total_Medals     0
dtype: int64

In [46]:
# Display rows with nans
medals_cleaned[medals_cleaned.isnull().any(axis=1)].head()['Country_Name'].unique()

array(['North Korea', 'United Arab Emirates', 'Russia', 'Niger',
       'Burundi'], dtype=object)

Manually fill in the null values for the country codes

In [47]:
# Manually define country codes for specific countries
manual_country_codes = {
    'North Korea': 'PRK',
    'United Arab Emirates': 'UAE',
    'Russia': 'RUS',
    'Niger': 'NIG',
    'Burundi': 'BDI',
    'Trinidad and Tobago': 'TTO',
    'Vietnam': 'VIE',
    'Independent Olympic Athletes': 'IOA',
    'Tajikistan': 'TJK',
    'Algeria': 'ALG',
    'Singapore': 'SGP'
}


# Apply these manual country codes to the data
medals_cleaned['Country_Code'] = medals_cleaned.apply(
    lambda row: manual_country_codes.get(row['Country_Name'], row['Country_Code']), axis=1
)

In [48]:
medals_cleaned.isnull().sum()

Year            0
Host_country    0
Host_city       0
Country_Name    0
Country_Code    0
Gold            0
Silver          0
Bronze          0
Total_Medals    0
dtype: int64

Our data is now clean and ready to use

In [49]:
medals_cleaned.head()

Unnamed: 0,Year,Host_country,Host_city,Country_Name,Country_Code,Gold,Silver,Bronze,Total_Medals
0,1896,Greece,Athens,Great Britain,GBR,2,3,2,7
1,1896,Greece,Athens,Hungary,HUN,2,1,3,6
2,1896,Greece,Athens,France,FRA,5,4,2,11
3,1896,Greece,Athens,United States,USA,11,7,2,20
4,1896,Greece,Athens,Germany,GER,6,5,2,13


In [50]:
medals.shape

(1344, 8)

In [55]:
medals_cleaned = medals_cleaned[medals_cleaned['Year'] >= 1999]
medals_cleaned

Unnamed: 0,Year,Host_country,Host_city,Country_Name,Country_Code,Gold,Silver,Bronze,Total_Medals
838,2000,Australia,Sydney,Spain,ESP,3,3,5,11
839,2000,Australia,Sydney,Canada,CAN,3,3,8,14
840,2000,Australia,Sydney,Iran,IRI,3,0,1,4
841,2000,Australia,Sydney,Turkey,TUR,3,0,2,5
842,2000,Australia,Sydney,Belarus,BLR,3,3,11,17
...,...,...,...,...,...,...,...,...,...
1339,2020,Japan,Tokyo,Fiji,FIJ,1,0,1,2
1340,2020,Japan,Tokyo,Estonia,EST,1,0,1,2
1341,2020,Japan,Tokyo,Latvia,LAT,1,0,1,2
1342,2020,Japan,Tokyo,Bermuda,BER,1,0,0,1


## Dataset #2: GDP by Country 1999 - 2022

#### Data Cleaning Explanation:

For this dataset, the data was filtered to include only Olympic `year`s (e.g., `2000, 2004, 2008`, etc.) using `df[df['Year'].isin([2000, 2004, 2008, 2012, 2016, 2020])]`. This alignment ensures that the GDP data directly corresponds with Olympic medal data. Using df.isna().all, we see that this dataset has no `NaN` values. One of the challenges to address with future data merging will be to remove the excessive formality in the `country` names such as 'Republic', 'Democratic', and other `country` identification terminology. Later on, the goal will be to ensure consistency in country names by using a dictionary for country name mapping (e.g., {'United States': 'USA', 'Russian Federation': 'Russia'}) and apply it with `df['Country'].replace(mapping_dict)`. This will allow us to merge datasets later without issues. Moreover, since GDP values often have a large range, we considered a log transformation to the GDP column using `df['GDP'] = np.log1p(df['GDP'])`. Which could potentially reduce skewness and make data distribution more manageable for analysis.


In [29]:
hist_gdps = pd.read_csv('GDPs99-22.csv')
hist_gdps.isna().all()

Country    False
1999       False
2000       False
2001       False
2002       False
2003       False
2004       False
2005       False
2006       False
2007       False
2008       False
2009       False
2010       False
2012       False
2013       False
2014       False
2015       False
2016       False
2017       False
2018       False
2019       False
2020       False
2021       False
2022       False
dtype: bool

In [30]:
hist_gdps

Unnamed: 0,Country,1999,2000,2001,2002,2003,2004,2005,2006,2007,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,"Afghanistan, Rep. of.",0,0,0,4.084,4.585,5.971,7.309,8.399,9.892,...,21.555,24.304,0,0,0,0,0,0,0,0
1,Albania,3.444,3.695,4.096,4.456,5.6,7.452,8.376,9.133,10.163,...,14.91,16.053,11.591,12.204,13.214,14.341,15.553,16.996,16.77,18.012
2,Algeria,48.845,54.749,55.181,57.053,68.013,85.016,102.38,114.322,116.158,...,190.432,203.449,175.077,181.71,192.256,202.179,210.906,219.16,163.812,168.195
3,Angola,6.153,9.135,8.936,11.386,13.956,19.8,30.632,43.759,55.37,...,136.415,151.089,102.011,98.815,105.369,112.533,119.403,127.15,70.339,74.953
4,Antigua and Barbuda,0.652,0.678,0.71,0.718,0.754,0.818,0.875,0.962,1.026,...,1.404,1.494,1.285,1.328,1.386,1.458,1.536,1.617,1.405,1.534
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
175,Venezuela,97.977,117.153,122.872,92.889,83.442,112.8,143.443,181.608,219.372,...,403.123,409.562,131.855,133.534,144.227,155.096,170.41,184.364,44.893,43.546
176,Vietnam,28.684,31.196,32.504,35.148,39.63,45.548,53.053,60.995,68.298,...,135.729,148.914,198.805,214.75,229.845,247.415,265.987,287.257,368.002,415.493
177,"Yemen, Republic of",7.53,9.561,9.533,9.985,11.869,13.565,15.193,18.7,21.657,...,40.003,42.687,0,0,0,0,0,0,0,0
178,Zambia,3.132,3.238,3.64,3.775,4.326,5.44,7.271,10.942,10.104,...,21.829,23.613,24.466,25.158,27.17,29.911,32.957,36.316,21.699,23.967


In [34]:
index_hist_gdps = hist_gdps.set_index('Country')
hist_gdps = index_hist_gdps[[year for year in index_hist_gdps.columns if int(year) % 4 == 0]].reset_index()

In [35]:
remove_punctuation = r'[.,\'-]'
hist_gdps['Country'] = hist_gdps['Country'].str.replace(remove_punctuation, '', regex=True)

remove_words = r'\b(Rep|Republic|Demo|Democratic|of|the|People|Côte d|Equatorial|Islamic|Former)\b'
hist_gdps['Country'] = hist_gdps['Country'].str.replace(remove_words, '', regex=True, case=False)

hist_gdps['Country'] = hist_gdps['Country'].str.strip()
hist_gdps

Unnamed: 0,Country,2000,2004,2008,2012,2016,2020
0,Afghanistan,0,5.971,11.513,19.248,0,0
1,Albania,3.695,7.452,11.131,13.808,12.204,16.996
2,Algeria,54.749,85.016,126.889,177.83,181.71,219.16
3,Angola,9.135,19.8,67.608,118.426,98.815,127.15
4,Antigua and Barbuda,0.678,0.818,1.074,1.322,1.328,1.617
...,...,...,...,...,...,...,...
175,Venezuela,117.153,112.8,231.959,394.106,133.534,184.364
176,Vietnam,31.196,45.548,76.414,123.505,214.75,287.257
177,Yemen,9.561,13.565,24.504,37.153,0,0
178,Zambia,3.238,5.44,10.519,20.208,25.158,36.316


## Dataset #3: 2024 Olympic Medals vs GDP

#### Data Cleaning Explanation:

We will address this dataset later for merging purposes. As of now, there appears to be no null data from `olympics24.isna().all()`. A conversion metric for `GDP` will need to be necessary for merging to the other datasets. Finally GDP will have to be readjusted for '2024' instead of 2023 as listed in `GDP_Year`. 

In [33]:
olympics24 = pd.read_csv('olympics.csv')
olympics24.isna().all()

country         False
country_code    False
region          False
gold            False
silver          False
bronze          False
total           False
gdp             False
gdp_year        False
population      False
dtype: bool

In [24]:
olympics24

Unnamed: 0,country,country_code,region,gold,silver,bronze,total,gdp,gdp_year,population
0,United States,USA,North America,40,44,42,126,81695.19,2023,334.9
1,China,CHN,Asia,40,27,24,91,12614.06,2023,1410.7
2,Japan,JPN,Asia,20,12,13,45,33834.39,2023,124.5
3,Australia,AUS,Oceania,18,19,16,53,64711.77,2023,26.6
4,France,FRA,Europe,16,26,22,64,44460.82,2023,68.2
...,...,...,...,...,...,...,...,...,...,...
85,Peru,PER,South America,0,0,1,1,7789.87,2023,34.4
86,Qatar,QAT,Asia,0,0,1,1,87480.42,2022,2.7
87,Singapore,SGP,Asia,0,0,1,1,84734.26,2023,5.9
88,Slovakia,SVK,Europe,0,0,1,1,24470.24,2023,5.4


# Ethics & Privacy

With regards to privacy, terms of use and security, the dataset being used is publicly available on Kaggle, and there are no direct privacy issues involving individuals. However, it is important to note that the data is used in accordance with Kaggle's terms of service, and that any findings are not misrepresented or used to make harmful generalizations about certain countries or groups.

There exist potential biases and limitations in the data set present through the exclusion of certain nations from the set. Countries that have not participated in the 2024 Olympics or those that have missing GDPs are excluded from this set; which in turn creates a bias by limiting the analysis to only those countries for which their data is available. As such, this leads to a possibility that the data may be disproportionate or skewed to negatively affect less economically developed nations, as their absence could lead to underestimating their performance potential of nations with fewer economic resources. Moreover, analyzing whether GDP correlates with Olympic medal count implicitly assumes that economic power should or does lead to success in sports. This assumption can marginalize less wealthy nations as GDP alone is not an adequate representation of a country's ability to win medals. There are many other cultural and social factors that play significant roles in this case; which is why it is essential to acknowledge that athletic talent is universal, but opportunities to develop that talent are not. The framing of our findings must be approached with caution, emphasizing that the purpose is to identify patterns rather than to justify disparities between nations. 

Our group will acknowledge the potential biases and limitations in the dataset through the discussion of our results, which ensures that these potential biases are clear to readers. When communicating this analysis, we will be transparent about the notion that correlations observed do not imply causation, and we will emphasize that our analysis is intended to explore patterns rather than make normative claims about countries’ athletic and economic capabilities. Additionally, in order to mitigate these biases we plan to include other factors that influence GDP such as government spending on sports, total investment, etc. so as to gather a more nuanced conclusion with regards to the determinants of Olympic success. Furthermore, throughout the data cleaning and model representation process, we will assess the distribution of countries based on GDP and medals to identify any biases in representation; by evaluating model performance for potential overfitting to high-GDP countries. This is significant to the reporting phase, as we consider how our results create a general conclusion, which will be framed in order to avoid the implication that economic power alone determines success. For instance, while we may observe a correlation between GDP and medal count, it is critical to stress that this does not mean only wealthy countries can succeed in the Olympics. Which in turn dives into the ethical implications of this research question, more specifically how these findings may potentially reflect global inequalities in sports development; by raising awareness of the socioeconomic factors that contribute to Olympic success.

# Team Expectations 

- We all agree to meet virtually via Google Meets, Wednesdays at 4pm to check in with one another and work together.
- We will make decisions through a unanimous vote. If a decision needs to be made in a short time frame and the other members are unresponsive, the individual is free to make a decision. 
- Everyone will be equally doing a bit of leading, communicating, programming and research. Tasks will be assigned during our weekly meetings through discussion. We will track our progress on a timesheet via Google Docs/Sheets.
- We are committed to equally contributing to our project through discussing our roles, ideas and dividing up the work.
- When we are unsure, or have questions or thoughts we will remain in communication with one another via our group chat or individually. 

# Project Timeline Proposal

| Meeting Date | Meeting Time | Completed Before Meeting                                                               | Discuss at Meeting                                                                                      |
|--------------|--------------|----------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|
| 10/30        | 5 PM         | Project Proposal Sections Complete                                                     | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research |
| 11/6         | 5 PM         | Select Dataset(s), perform Exploratory Data Analysis. Have a solid grasp of Datasets. Commit initial deliverables to Github repository | Upload Datasets to a notebook, discuss methods to clean Datasets and discuss an analysis approach. Learn how to manage the group’s Github repository. |
| 11/13        | 5 PM         | Discuss at least 3 visualizations, and be able to explain the steps to making the charts with the data | Gather feedback from other group members, criticize relevance of charts pertaining to objective questions and the Null Hypothesis. |
| 11/20        | 5 PM         | Build Models (if need) to explain EDA. Finalize Analysis (EDA complete). (Rough Draft done at this point) | Understand what the entire analysis approach means and finalize EDA explanations. Plan for video recording. |
| 11/27        | 5 PM         | Finish up Privacy Conclusion. If possible, have the video done as well.                | Final review of Project Notebook, from Start to Finish. Ensure thoroughness and quality of the project with the group. |
| 11/30        | 5 PM         | Complete analysis; Final Draft results/conclusion/discussion (Wasp)                    | Discuss/edit full project. Review Video. Submit final project if desired.                               |
| 12/4         | Before 11:59 PM | Ensure everything is submitted and available (on github) and deliverable is finalized. (if necessary) | None. Relax.                                                                                             |