# Cleaning Data
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp24&branch=main&urlpath=tree%2Fdata271_sp24%2Fdemos%2Fdata271_demo28_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook. 

In [None]:
import numpy as np
import pandas as pd

In [None]:
# import the data from last time
df = pd.read_csv('youtube_humboldt.csv')
df = df.iloc[:,1:]
df.head()

## Cleaning the data 

### Reformat column names

In [None]:
# OPTION 1 .rename method
df.rename(columns = {'videoId':'video_id',
                                'publishedAt':'published_at',
                                'viewCount':'view_count',
                                'likeCount':'like_count'})

In [None]:
# does not update original (use inline=True)
df

In [None]:
# OPTION 2: more general approach that would work for many columns
# Split column names by words
import re
split_by_words = [re.split('(?=[A-Z])',i) for i in df.columns]
split_by_words

In [None]:
# Insert an underscore between words
columns_with_underscores = ['_'.join(i) for i in split_by_words]
columns_with_underscores

In [None]:
# Reassign column names
df.columns = columns_with_underscores

In [None]:
# Make everything lowercase
df.columns = df.columns.str.lower()
df.head()

### Convert datatypes

In [None]:
# Suppose we wanted to look at the relationship between view counts and like counts
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
sns.scatterplot(data = df, x='view_count',y='like_count')
plt.show()

In [None]:
# Something is funky. Check the data types
df.dtypes

In [None]:
# convert the data types
df = ...

In [None]:
# check data types again
df.dtypes

In [None]:
# plot again
sns.scatterplot(data = df, x='view_count',y='like_count')
plt.show()

In [None]:
# What if we wanted to plot trend in view counts?
...
plt.show()

In [None]:
# change dates to datetime data
df = ...
df.head()

In [None]:
# Check dtypes again
df.dtypes

In [None]:
# plot again

plt.show()

#### More on working with dates

In [None]:
date1 = 'March 29, 2024'


In [None]:
date2 = 'Mar 29, 2024'


In [None]:
date3 = '3/29/10'


In [None]:
date4 = '29-03-10'


### Reset the index

In [None]:
# Reset the index for easy access by video id

df.head()

In [None]:
# index by video id
video_id = "GpOplrOC7X0"


### Reorder the data

In [None]:
# reorder rows


In [None]:
# doesn't update original
df

In [None]:
# reorder columns


In [None]:
# reorder columns manually
df[['view_count','like_count','date','published_at','title']]

In [None]:
# sort the data 


In [None]:
# doesn't update original
df

In [None]:
# do it inplace


In [None]:
df

In [None]:
# sort by multiple values at once
df.sort_values(by = ['view_count','like_count'],ascending=[False,True], inplace=True)
df

### Reshaping data

In [None]:
df_weather_wide = pd.read_csv('sample_weather.csv')
df_weather_wide = df_weather_wide.iloc[:,1:]
df_weather_wide

In [None]:
# transpose the data


In [None]:
# transpose with more informative columns


In [None]:
# change wide format data into long format
long_df = 
long_df

In [None]:
# change long format back into wide format


### What do when there are multiple values in categories 

In [None]:
long_df = pd.read_csv('long_data.csv')
long_df = long_df.iloc[:,1:]
long_df.head()

In [None]:
# Pivot the data to get average sales by date and category


In [None]:
# Pivot the data to get TOTAL sales by date and category
wide_df = 
wide_df

In [None]:
# Pivot the data to get TOTAL sales by date, product, and category


In [None]:
# Go from wide to long
wide_df.reset_index().melt(id_vars='date', var_name=['type','category'])

## Activity

In [None]:

# Create a DataFrame with data cleaning and reshaping opportunities
data = {
    'Pet Name': ['Fluffy', 'Whiskers', 'Bubbles', 'Spike', 'Coco', 'Maybelle', 'Snowball'],
    'Date Adopted': ['10-01-2023','03-04-2024','01-10-2024','02-14-2024','11-22-2023','01-04-2024','12-25-2025'],
    'Animal Type': ['Cat', 'Cat', 'Fish', 'Dog', 'Fish', 'Dog', 'Cat'],
    'Pet Age': ['3', '2', '13', '5', '4', '3', '2'],
    'Color': ['White', 'Gray', 'Orange', 'White', 'White', 'Black', 'Black'],
    'Happiness Level': ['High', 'Medium', 'High', 'Low', 'High', 'High', 'Medium']
}
df_pets = pd.DataFrame(data)
df_pets

**Activity 1:** Rename the columns of the pets dataframe to be in a better format.

**Activity 2:** Change any datatypes that should be adjusted.  

**Activity 3:** Practice pivoting the dataframe.