# Data Analysis in Base Python Guided Practice

Adapted from [Data Serialization Formats - Cumulative Lab](https://github.com/learn-co-curriculum/dsc-data-serialization-lab)

## Objectives
* Practice reading serialized JSON and CSV data from files into Python objects
* Practice extracting information from nested data structures
* Practice cleaning data (filtering, normalizing locations, converting types)
* Combine data from multiple sources into a single data structure
* Interpret descriptive statistics and data visualizations to present your findings

### Business Understanding

##### What is the relationship between the population of a country and their performance in the 2018 FIFA World Cup?

<p><a href="https://commons.wikimedia.org/wiki/File:World_cup_countries_best_results_and_hosts.PNG#/media/File:World_cup_countries_best_results_and_hosts.PNG"><img src="https://upload.wikimedia.org/wikipedia/commons/b/b7/World_cup_countries_best_results_and_hosts.PNG" alt="World cup countries best results and hosts.PNG" height="563" width="1280"></a><br><a href="http://creativecommons.org/licenses/by-sa/3.0/" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=578740">Link</a></p>

In this analysis, we are going to look at a sample of World Cup games in 2018 and the corresponding 2018 populations of the participating countries. 

### Data Understanding

We will be working with two separate data sources for this analysis

#### `world_cup_2018.json`

* **Source**: This dataset comes from [`football.db`](http://openfootball.github.io/)
* **Contents**: Data about all games in the 2018 World Cup, including date, location (city and stadium), teams, goals scored (and by whom), and tournament group
* **Format**: Nested JSON data (dictionary containing a list of rounds, each of which contains a list of matches, each of which contains information about the teams involved and the points scored)

#### `country_populations.csv`

* **Source**: This dataset comes from a curated collection by [DataHub.io](https://datahub.io/core/population), originally sourced from the World Bank
* **Contents**: Data about populations by country for all available years from 1960 to 2018
* **Format**: CSV data, where each row contains a country name, a year, and a population

### Steps for Analysis

#### 1. Get the data.

Read in both our `json` and `csv` file required for analysis and explore their architecture.

#### 2. List of teams in 2018 World Cup

Sorted list of all teams who competed in 2018 FIFA World Cup.

#### 3. Associating countries with 2018 World Cup performance

Create data structure which connects a team name (country name) to its performance in the 2018 FIFA World Cup. Use total games won as metric to represent performance.

#### 4. Associating countries with 2018 population

Add information on population for each country in 2018 to existing data structure.

#### 5. Analyze Population vs. Performance

Choose an appopriate statistical measure to analyze the relationship between population and performance. Create a visualization representing this relationship.

## Setting the Intention

When you are first handed a new dataset, the first thing you will do is perform a basic exploration. What data is available? What format is it in? How is it structured and what does that mean in terms of your analysis?

It's good practice to approach this exploration with some sort of intention. We have a general idea of the data we want to use. As we begin investigating our data sources try to keep this in mind and take notes about how we can access and work with this data.

## 1. Getting the Data

Our two data sources come in two different formats: `json` and `csv`. Below we read both `world_cup_2018.json` and `country_populations.csv` into our jupyter notebook and explore their basic architecture.

Start by importing the Python's `json` and `csv` libraries.

In [None]:
# Your code here (import json and csv)

Next, we will want to open the relevant files. Both files are saved inside a `data` folder within this directory so the file path is `data/<file_name>`. 

In the cells below, we open each file for you. 

Please use the `json` module to load the data from `world_cup_file` into a dictioanary called `world_cup_data`.

In [None]:
# Replace None with the appropriate code
with open('data/world_cup_2018.json', encoding='utf8') as world_cup_file:
    world_cup_data = None

Please use the `csv` module to load the data from `population_file` into a list of dictionaries called `population_data`

In [None]:
# Replace None with the appropriate code
with open('data/country_populations.csv') as population_file:
    population_data = None

### Let's take a look at our data

An important part of being a data analyst is being able to work with data in various formats. 

We know `population_data` is a list from how we read the data from the file. Let's look at the elements in this list and how many there are.

In [None]:
# Check the first three elements in population_data


In [None]:
# Check how many elements are in population_data


Describe `population_data`:

Explore structure of `world_cup_data`. Though `json` data will always resemble a dictionary like format, the structure of the data can vary widely. The first step of working with `json` data should always be to get an idea of the data structure (schema).

In [None]:
# Explore world_cup_data (Messy view)
print(None)

In [None]:
# Explore world_cup_data - Cleaner view
print(None)

Continue exploring `world_cup_data`

In [None]:
# Your code here (add as many cells as needed)

Describe `world_cup_data`:
- JSON structure with keys `name` and `rounds`
- `rounds` key is a list of the World Cup rounds. Contains the info on all matches
- `matches` is data we want - tells us winning/losing team
- 20 total rounds

## 2. List of teams in 2018 FIFA World Cup

Create alphabetically-sorted list of teams who competed in the 2018 FIFA World Cup.

Take a look another look at the `world_cup_data` we just explored and outline a few steps we could follow to accomplish this.

#### Steps for Task 2:
- REPLACE WITH STEPS

In [None]:
# Your code here (add as many cells as needed)

## 3. Countries and their 2018 World Cup performance

> *Create a data structure* connecting a team name (country name) to its performance in 2018 FIFA World Cup. Use count of games won in entire tournament to represent performance.

> Create visualizations to help audience understand distribution of games won and performance of each team.

We want our data structure to connect a country name to the number of wins and eventually that country's population as well. For this exercise, we'll be building a **dictionary** where each key is the name of the country and each value is a nested dictionary containing information about that country's performance and population

The final result will look something like this:
```
{
  'Argentina': { 'wins': 1, 'population': 44494502 },
  ...
  'Uruguay':   { 'wins': 4, 'population': 3449299  }
}
```

For the current step (step 3), we'll build a data structure that looks something like this:
```
{
  'Argentina': { 'wins': 1 },
  ...
  'Uruguay':   { 'wins': 4 }
}
```

Once again, let's outline a few steps we could take to accomplish this. 

#### Steps for Task 3:
- REPLACE WITH STEPS

In [None]:
# Your code here (add as many cells as needed)

### Analysis of Wins

While we could try to understand all 32 of those numbers just by scanning through them, let's use some descriptive statistics and data visualizations instead!

#### Statistical Summary of Wins

Calculates the mean, median, and standard deviation of the number of wins. 

In order to do this, you will find need to create a list of number of wins each country had in the World Cup.

In [None]:
# Your code here

#### Visualizations of Wins

In addition to those numbers, let's make a histogram (showing the distributions of the number of wins) and a bar graph (showing the number of wins by country).

In [None]:
# Your code here

#### Interpretation of Win Analysis

Before we move to looking at the relationship between wins and population, it's useful to understand the distribution of wins alone. A few notes of interpretation:
* **INTERPRETATION HERE**

## 4. Associating Countries with 2018 Population

> Add to the existing data structure so that it also connects each country name to its 2018 population, and create visualizations comparable to those from step 2.

Now we're ready to add the 2018 population to `combined_data`, finally using the CSV file!

Recall that `combined_data` currently looks something like this:
```
{
  'Argentina': { 'wins': 1 },
  ...
  'Uruguay':   { 'wins': 4 }
}
```

And the goal is for it to look something like this:
```
{
  'Argentina': { 'wins': 1, 'population': 44494502 },
  ...
  'Uruguay':   { 'wins': 4, 'population': 3449299  }
}
```

To do that, we need to extract the 2018 population information from the CSV data.

### Exploring the Structure of the Population Data CSV

Recall that previously we loaded information from a CSV containing population data into a list of dictionaries called `population_data`.

In [None]:
# Run this cell without changes
len(population_data)

12,695 is a very large number of rows to print out, so let's look at some samples instead.

In [None]:
# Run this cell without changes
np.random.seed(42)
population_record_samples = np.random.choice(population_data, size=10)
population_record_samples

There are **2 filtering tasks**, **1 data normalization task**, and **1 type conversion task** to be completed, based on what we can see in this sample. We'll walk through each of them below.

(In a more realistic data cleaning environment, you most likely won't happen to get a sample that demonstrates all of the data cleaning steps needed, but this sample was chosen carefully for example purposes.)

### Filtering Population Data

We already should have suspected that this dataset would require some filtering, since there are 32 records in our current `combined_data` dataset and 12,695 records in `population_data`. Now that we have looked at this sample, we can identify 2 features we'll want to use in order to filter down the `population_data` records to just 32. Try to identify them before looking at the answer below.

.

.

.

*Answer: the two features to filter on are* ***`'Country Name'`*** *and* ***`'Year'`***. *We can see from the sample above that there are countries in `population_data` that are not present in `combined_data` (e.g. Malta) and there are years present that are not 2018.*

In the cell below, create a new variable `population_data_filtered` that only includes relevant records from `population_data`. Relevant records are records where the country name is one of the countries in the `teams` list, and the year is "2018".

(It's okay to leave 2018 as a string since we are not performing any math operations on it, just make sure you check for `"2018"` and not `2018`.)

In [None]:
# Your code here
    
len(population_data_filtered)

Hmm...what went wrong? Why do we only have 27 records, and not 32?

Did we really get a dataset with 12k records that's missing 5 of the data points we need?

Let's take a closer look at the population data samples again, specifically the third one:

In [None]:
# Run this cell without changes
population_record_samples[2]

And compare that with the value for Iran in `teams`:

In [None]:
# Run this cell without changes
teams[13]

Ohhhh...we have a data normalization issue! One dataset refers to this country as `'Iran, Islamic Rep.'`, while the other refers to it as `'Iran'`. This is a common issue we face when using data about countries and regions, where there is no universally-accepted naming convention.

### Normalizing Locations in Population Data

Sometimes data normalization can be a very, very time-consuming task where you need to find "crosswalk" data that can link the two formats together, or you need to write advanced regex formulas to line everything up.

For this task, there are only 5 missing, so we'll just go ahead and give you a function that makes the appropriate substitutions.

In [None]:
def normalize_location(country_name):
    """
    Given a country name, return the name that the
    country uses when playing in the FIFA World Cup
    """
    name_sub_dict = {
        "Russian Federation": "Russia",
        "Egypt, Arab Rep.": "Egypt",
        "Iran, Islamic Rep.": "Iran",
        "Korea, Rep.": "South Korea",
        "United Kingdom": "England"
    }
    # The .get method returns the corresponding value from
    # the dict if present, otherwise returns country_name
    return name_sub_dict.get(country_name, country_name)

# Example where normalized location is different
print(normalize_location("Russian Federation"))
# Example where normalized location is the same
print(normalize_location("Argentina"))

Now, write new code to create `population_data_filtered` with normalized country names.

In [None]:
# Your code here
    
len(population_data_filtered)

Great, now we should have 32 records instead of 27!

### Type Conversion of Population Data

We need to do one more thing before we'll have population data that is usable for analysis. Take a look at this record from `population_data_filtered` to see if you can spot it:

In [None]:
# Run this cell without changes
population_data_filtered[0]

Every key has the same data type (`str`), including the population value. In this example, it's `'44494502'`, when it needs to be `44494502` if we want to be able to compute statistics with it.

In the cell below, loop over `population_data_filtered` and convert the data type of the value associated with the `"Value"` key from a string to an integer, using the built-in `int()` function.

In [None]:
# Your code here

In [None]:
 # Look at the last record to make sure the population
# value is an int
population_data_filtered[-1]

In [None]:
# Check that it worked
type(population_data_filtered[-1]['Value'])

### Adding Population Data

Now it's time to add the population data to `combined_data`! Recall that the data structure currently looks like this:

In [None]:
# Run this cell without changes
combined_data

The goal is for it to be structured like this:
```
{
  'Argentina': { 'wins': 1, 'population': 44494502 },
  ...
  'Uruguay':   { 'wins': 4, 'population': 3449299  }
}
```

In the cell below, loop over `population_data_filtered` and add information about population to each country in `combined_data`:

In [None]:
# Your code here

### Analysis of Population

Let's perform the same analysis for population that we performed for count of wins.

Calculates the mean, median, and standard deviation of the population of countries. 

In order to do this, you will find need to create a list of the population for each country.

#### Statistical Analysis of Population

In [None]:
# Your code here

#### Visualizations of Population

Let's also make a histogram (showing distribution of population) and a bar graph (showing population by country)

In [None]:
# Your code here

#### Interpretation of Population Analysis

* **INTERPRETATION HERE**

## 5. Analysis of Population vs. Performance

> Choose an appropriate statistical measure to analyze the relationship between population and performance, and create a visualization representing this relationship.

### Statistical Measure
So far we have learned about only two statistics for understanding the *relationship* between variables: **covariance** and **correlation**. We will use correlation here, because that provides a more standardized, interpretable metric.

In [None]:
# Run this cell without changes
np.corrcoef(wins, populations)[0][1]

Interpret correlation coefficient:
- **INTERPRETATION HERE**

### Data Visualization

A **scatter plot** is he most sensible form of data visualization for showing this relationship, because we have two dimensions of data, but there is no "increasing" variable (e.g. time) that would indicate we should use a line graph.

In [None]:
# Run this cell without changes

# Set up figure
fig, ax = plt.subplots(figsize=(8, 5))

# Basic scatter plot
ax.scatter(
    x=populations,
    y=wins,
    color="gray", alpha=0.5, s=100
)
ax.set_xlabel("2018 Population")
ax.set_ylabel("2018 World Cup Wins")
ax.set_title("Population vs. World Cup Wins")

# Add annotations for specific points of interest
highlighted_points = {
    "Belgium": 2, # Numbers are the index of that
    "Brazil": 3,  # country in populations & wins
    "France": 10,
    "Nigeria": 17
}
for country, index in highlighted_points.items():
    # Get x and y position of data point
    x = populations[index]
    y = wins[index]
    # Move each point slightly down and to the left
    # (numbers were chosen by manually tweaking)
    xtext = x - (1.25e6 * len(country))
    ytext = y - 0.5
    # Annotate with relevant arguments
    ax.annotate(
        text=country,
        xy=(x, y),
        xytext=(xtext, ytext)
    )

### Data Visualization Interpretation

Interpret this plot in the cell below. Does this align with the findings from the statistical measure (correlation), as well as the map shown at the beginning of this lab (showing the best results by country)?

##### Visual Interpretation:

- **INTERPRETATION HERE**

### Final Analysis

> What is the relationship between the population of a country and their performance in the 2018 FIFA World Cup?

**ANSWER**

 