# Datacamp Personal Takeaways 
## Cleaning Data With Python by Daniel Chen
***

## 1.- Exploring your data.
Use pandas to review shape and current status of data.

### Using dataframe methods and attributes.
```python
# Import libraries needed
import pandas as pd

df.head()
df.tail()
df.shape
df.columns
df.info()
# If column name is not strange, can use dot notation.
df.columnname.value_counts()
# See statistic info from dataset.
df.columnname.describe()

df.columnname.plot(kind="hist")
df.boxplot(column ="columnname1", by= "columnname2")
df.plot(kind= "scatter", x="columnname1", y="columname2")
```

## 2.- Tidying data for analysis.
Use Hadley Wickham's tidy data concept: 
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

**Remember!**

This is for data cleaning, not for data viz!

### Using melt.
```python
# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars='Date', var_name="measurement", value_name="reading")
```

### Using pivot.
```python
# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot(index="Date", columns="measurement", values="reading")
```

### Reset index after pivot.
```python
# Reset the index of airquality_pivot: airquality_pivot_reset
airquality_pivot_reset = airquality_pivot.reset_index()
```

### Pivot duplicate values.
```python
# Pivot table the airquality_dup: airquality_pivot
airquality_pivot = airquality_dup.pivot_table(index="Date", columns="measurement", values="reading", aggfunc=np.mean)

# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()
```

### Creating new columns with string slicing.
```python
# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]

# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]
```

### Creating new columns with string splitting.
```python
# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt["type_country"].str.split("_")

# Create the 'type' column
ebola_melt['type'] = ebola_melt["str_split"].str.get(0)

# Create the 'country' column
ebola_melt['country'] = ebola_melt["str_split"].str.get(1)
```

## 3.- Combining data for analysis.
So what if you have multiple datasets?

### Using concatenation row or column wise.
```python
# Concatenate uber1, uber2, and uber3: row_concat
row_concat = pd.concat([uber1,uber2,uber3])

# Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([ebola_melt, status_country], axis=1)
```

### Using Globbing to find files in machine to concatenate.
```python
"""
Wildcards: * and ?
    Any csv file: *.csv
    Any single character: file_?.csv
"""
# Import necessary modules
import glob
import pandas as pd

# Write the pattern: pattern
pattern = '*.csv'

# Save all file matches: csv_files
csv_files = glob.glob(pattern)

# Create an empty list: frames
frames = []

#  Iterate over csv_files
for csv in csv_files:

    #  Read csv into a DataFrame: df
    df = pd.read_csv(csv)
    
    # Append df to frames
    frames.append(df)

# Concatenate frames into a single DataFrame: uber
uber = pd.concat(frames)
```

### Merging data.
```python
# Merge the DataFrames: o2o
o2o = pd.merge(left=site, right=visited, left_on="name", right_on="site")

# Merge the DataFrames: m2o
m2o = pd.merge(left=site, right=visited, left_on="name", right_on="site")

# Merge survey, visited and site to a single data frame
# of surveys with visit and site details

test = pd.merge(left=survey, right=visited, left_on="taken", right_on="ident")

m2m = pd.merge(left=test, right=site, left_on="site", right_on="name")
```


## 4.- Cleaning data for analysis.
Changing data types of dirty data.

### Using categorical data type.
```python
# Convert the sex column to type 'category'
tips.sex = tips.sex.astype("category")

# Convert the smoker column to type 'category'
tips.smoker = tips.smoker.astype("category")
```

### Using numeric data.
```python
# Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips['total_bill'], errors='coerce')

# Convert 'tip' to a numeric dtype
tips['tip'] = pd.to_numeric(tips["tip"], errors="coerce")
```

### Using regular expresions.
```python
# Import the regular expression module
import re

# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')

# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))

# See if the pattern matches
result2 = prog.match('1123-456-7890')
print(bool(result2))

# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')

# Print the matches
print(matches)

# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)

# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))
print(pattern2)

# Write the third pattern
pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia'))
print(pattern3)
```

### Using functions to clean data.
```python
# Define recode_gender()
def recode_gender(gender):

    # Return 0 if gender is 'Female'
    if gender == "Female":
        return 0
    
    # Return 1 if gender is 'Male'    
    elif gender == "Male":
        return 1
    
    # Return np.nan    
    else:
        return np.nan

# Apply the function to the sex column
tips['recode'] = tips["sex"].apply(recode_gender)
```

### Using lambda functions.
```python
# Write the lambda function using replace
tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', ''))

# Write the lambda function using regular expressions
tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0])
```

### Dealing with missing data.

```python
# Drop the duplicates: tracks_no_duplicates
tracks_no_duplicates = tracks.drop_duplicates()

# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality.Ozone.mean()

# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality.Ozone.fillna(oz_mean)


### Assert statements.
This will return an Assert error if false.
```python
# Assert that there are no missing values
assert ebola.notnull().all().all()

# Assert that all values are >= 0
assert (ebola >= 0).all().all()
```

## 5.- Case study.
Do it!

```python
def check_null_or_valid(row_data):
    """Function that takes a row of data,
    drops all missing values,
    and checks if all remaining values are greater than or equal to 0
    """
    no_na = row_data.dropna()
    numeric = pd.to_numeric(no_na)
    ge0 = numeric >= 0
    return ge0

# Check whether the first column is 'Life expectancy'
assert g1800s.columns[0] == "Life expectancy"

# Check whether the values in the row are valid
assert g1800s.iloc[:, 1:].apply(check_null_or_valid, axis=1).all().all()

# Check that there is only one instance of each country
assert g1800s['Life expectancy'].value_counts()[0] == 1
```

```python
# Convert the year column to numeric
gapminder.year = pd.to_numeric(gapminder.year)

# Test if country is of type object
assert gapminder.country.dtypes == np.object

# Test if year is of type int64
assert gapminder.year.dtypes == np.int64

# Test if life_expectancy is of type float64
assert gapminder.life_expectancy.dtypes == np.float64
```

```python
# Print invalid_countries, a list of countries that doesn't 
# contain only letters, periods, and spaces


countries = gapminder.country.drop_duplicates()

pattern = '^[A-Za-z\s\.]*$'

invalid_countries = countries[~countries.str.contains(pattern)]
```

```
# Assert that country does not contain any missing values
assert pd.notnull(gapminder.country).all()

# Assert that year does not contain any missing values
assert pd.notnull(gapminder.year).all()

# Drop the missing values
gapminder = gapminder.dropna()
```

```python
# Add first subplot
plt.subplot(2, 1, 1) 

# Create a histogram of life_expectancy
gapminder.life_expectancy.plot(kind="hist")

# Group gapminder: gapminder_agg
gapminder_agg = gapminder.groupby('year')['life_expectancy'].mean()

# Print the head of gapminder_agg
print(gapminder_agg.head())

# Print the tail of gapminder_agg
print(gapminder_agg.tail())

# Add second subplot
plt.subplot(2, 1, 2)

# Create a line plot of life expectancy per year
gapminder_agg.plot()

# Add title and specify axis labels
plt.title('Life expectancy over the years')
plt.ylabel('Life expectancy')
plt.xlabel('Year')

# Display the plots
plt.tight_layout()
plt.show()

# Save both DataFrames to csv files
gapminder.to_csv('gapminder.csv')
gapminder_agg.to_csv('gapminder_agg.csv')

```