# Cleaning Data and Joining Datasets

In [None]:
## find 2 datasets 

## I. Cleaning Data

Cleaning data is often an important first step, data is rarely clean and nicely formatted. This can range from having to change datatypes to having to deal with null values, or oddly inputted null values for example using a string with "No Value Found". 

## II. Data Normalization

Very often we will need to normalize our data. For the most apt data analysis we want our data to be finely defined. If we have for example income per geocode, or per state. That is a poor comparison point as certain states have more people than others, in order to get meaning from this data we need to normalize the data. Below multiple methods of normalization will be discussed as well as how to do them using the python library `pandas`

### Monetary Adjustments 

#### Adjusting for Inflation

When working with time-series data, that is data over time. Adjusting for inflation is important. USD in 2009 is not valued the same as USD in 2024. 

#### Standardizing to one Currency

When dealing with data from multiple countries for example we may need to add an extra step. We should standardize the currencies to one currency. This is more complicated when dealing with multiple countries across multiple years. We will need to adjust for inflation for each individual country and then standardize to one currency

### Adjusting for Population 

When dealing with larger scale collected data we often need to take population size into account. If we have covid infections, we want to scale that by the population within our geographic region of covid infections. To do this we can simply standardize over our population: 
$$ \frac{\text{covid infections}_i}{\text{total population}_i}$$
Where i is the location, so for each location our output would be the number of covid infection in that location divided by the number of people in the location

#### Standardization

If we have data that is not necessarily related to other columns but we want to understand scale properly we can use standardization

The simplest definition of standardization is: 
$$\phi(x) = \frac{x - \overline{x}}{\sigma}$$
Where $\overline{x}$ is the column mean, and $\sigma$ is the standard deviation of the column

## III. Joining Datasets


There are multiple ways to merge or join datasets. To do this we will do what is called a join. There are 4 main types of joins to know **left**, **right**, **inner**, and **outer**. There are other variations of joins but these 4 are all that you need to understand to do most, if not all dataset joining. 

#### What is the difference between left, right, inner and outer joins?

The main difference between these joins is how we connect the data, and how we deal with misaligned data, that is data that appears in one dataset but not the other.

#### In all joins we need a join key (what column we are joining on)

Our join key is what we use to match rows in each dataset to eachother. For example if I have a dataset with geocodes and wanted to combine two datasets at the same geographic level with the same geocode format I could use a join to make a new dataset containing all the columns from the two datasets I am combining. 

In pandas, we use the following syntax:
``` pd.merge(how="method (left,right,inner,outer)", on=[join_key])```

### Left and Right Joins 

Left and right joins are very simple to think about. If we think about our first dataset as the left dataset, and the second datasets as the right dataset, the left or right join simply indicates which dataset to keep intact. 

For a left join any record (or row) that exists in the left dataset will exist in the final dataset regardless of whether there is data in the right dataset corresponding to the joinkey of the record in the left dataset. In this case null values will be used to fill in the values of the right data columns 

The right join is the opposite all rows from the right dataset will be stored in the new joined dataset and any data in the right dataset that does not have matching data in the left dataset will be filled with null values. 

*Important Note:* If you use a left join and data in the right side does not exist on the left dataset then it will be dropped and vice versa. If keeping all the data is important Outer Joins are a better choice of join method

### Inner and Outer Joins

#### Inner Joins

An inner join connects the two datasets for all records that have data within both datasets. So any data that is only in the right, or only in the left are dropped. 


#### Outer Joins

Outer joins do the opposite they combine two datasets in their entirety filling in any missing data with null values. So all records in both left and right dataset will continue to exist in the joined dataset, that is with null values for any mismatches.