# Cleaning Data

<center><img src="https://cdn.shopify.com/s/files/1/0814/0441/files/lafufu_daobubu__1_vivian.jpg" width=30%></center>

Is this Labubu or Lafufu?

## Goals

* Explore the Popularity of Labubu vs Lafufu
* Implement Data Cleaning Techniques

## Import Pandas

Before we begin, import Pandas.

In [1]:
# Import Pandas
import pandas as pd

## Revisiting Labubu Popularity

Let's take a look at the Labubu search interest again.
Load the dataset in the cell below.

In [2]:
# Load Labubu Data
# Source: Google Trends
file = "../data/labubu_interest_by_region.csv"

labubu_df = pd.read_csv(file)

## View Information Before Organization

Getting descriptive information about the dataset will give us an idea where to get started with the cleaning process.

Go ahead and do that in the cell below.

What do you notice? 

How might we go about cleaning the dataset?

In [3]:
# View Information
labubu_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Region    250 non-null    object 
 1   Interest  59 non-null     float64
dtypes: float64(1), object(1)
memory usage: 4.0+ KB


## Removing NaN Values

There are plenty of missing values in this dataset.
For the purposes of this exercise, let's go ahead and drop any null rows in the cell below.

In [5]:
## Remove NaN values from Labubu Data
labubu_df_clean = labubu_df.dropna()

## View the Updated Information

In [6]:
## View New Information
labubu_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 59 entries, 1 to 195
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Region    59 non-null     object 
 1   Interest  59 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.4+ KB


## Column Names

It may not seem like a big deal, but let's modify the column names.
Let's make them all lowercase for consistency and generally being easier to type when referenced.
Additionally, we should remove any leading or training white-spaces.

In [9]:
## Fix up the Columns
labubu_df_clean.columns = labubu_df_clean.columns.str.strip().str.lower()

## Take a look

At the updated columns in the cell below.

In [10]:
## Output Info...again
labubu_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 59 entries, 1 to 195
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   region    59 non-null     object 
 1   interest  59 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.4+ KB


## View Statistical Data

Did our small operation affect anything about the data in a significant way?

Let's get statistical information on the original and cleaned versions of the dataframe.

In [11]:
## View Original Statistical Data
labubu_df.describe()

Unnamed: 0,Interest
count,59.0
mean,22.745763
std,17.043143
min,1.0
25%,13.5
50%,21.0
75%,29.0
max,100.0


In [12]:
## View Cleaned Statistical Data
labubu_df_clean.describe()

Unnamed: 0,interest
count,59.0
mean,22.745763
std,17.043143
min,1.0
25%,13.5
50%,21.0
75%,29.0
max,100.0


## Not Really...

In terms of game changing impact, the bit of cleaning we did, had no affect on the statistical side of our dataset.

Let's look at a more problematic dataset.

Lafufu enters the chat.

## Load Lafufu Data

In [13]:
# Load Lafufu Data
## Data Source: Google Trends

file = "../data/lafufu_interest_by_region.csv"

lafufu_df = pd.read_csv(file)

In [14]:
# View Information
lafufu_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Region    250 non-null    object
 1   Interest  52 non-null     object
dtypes: object(2)
memory usage: 4.0+ KB


## Lower Case Column Names

Let's modify the column names in the cell below.

In [15]:
## Modify Column Names

# Making a copy 
lafufu_df_clean = lafufu_df.copy()

lafufu_df_clean.columns = lafufu_df_clean.columns.str.strip().str.lower()

## Remove NaN Values

In [17]:
## Remove NaN Values
lafufu_df_clean = lafufu_df_clean.dropna()

## View Updates

In [18]:
## View Information
lafufu_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 52 entries, 0 to 95
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   region    52 non-null     object
 1   interest  52 non-null     object
dtypes: object(2)
memory usage: 1.2+ KB


## Tail

Look at the last view values, notice anything?

In [20]:
## View the last rows
lafufu_df_clean.tail()

Unnamed: 0,region,interest
75,Taiwan,2
79,Philippines,1
85,Mexico,<1
94,Indonesia,<1
95,Japan,<1


## Get Statistical Data

In [21]:
## Statistical Data
lafufu_df_clean.describe()

Unnamed: 0,region,interest
count,52,52
unique,52,38
top,Czechia,<1
freq,1,3


## Change Data Type

Interest should be a float

__Syntax__:
```python
dataframe[column] = pd.to_numeric(dataframe[column], errors="coerce")
```

In [26]:
## Change Interest Column to Numeric Data Type
lafufu_df_clean["interest"] = pd.to_numeric(lafufu_df_clean["interest"],
                                           errors="coerce")

In [27]:
## View Information
lafufu_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 52 entries, 0 to 95
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   region    52 non-null     object 
 1   interest  49 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.2+ KB


## Check the Tail Again

In [28]:
## Check the Tail Again
lafufu_df_clean.tail()

Unnamed: 0,region,interest
75,Taiwan,2.0
79,Philippines,1.0
85,Mexico,
94,Indonesia,
95,Japan,


## Drop the Remaining Rows Containing Null Values

In [29]:
## Drop the Last Null Rows
lafufu_df_clean = lafufu_df_clean.dropna()

# Display the Information
lafufu_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 49 entries, 0 to 79
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   region    49 non-null     object 
 1   interest  49 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.1+ KB


## Check the Stats Again

In [31]:
## Display Statistical Information
lafufu_df_clean.describe()

Unnamed: 0,interest
count,49.0
mean,33.204082
std,23.460054
min,1.0
25%,14.0
50%,32.0
75%,45.0
max,100.0


## Where is Lafufu Most Popular by Google Search

In [33]:
## Display Top 10 Lafufu Regions
lafufu_df_clean.sort_values(by="interest",
                            ascending=False).head(10)

Unnamed: 0,region,interest
0,Czechia,100.0
2,Sweden,78.0
3,Serbia,77.0
4,Bosnia & Herzegovina,70.0
7,Croatia,63.0
8,Norway,63.0
9,Slovakia,62.0
10,Poland,61.0
11,Netherlands,61.0
12,Estonia,60.0
