# Data types
## Prepare and clean data

- There may be times we want to convert from one type to another
    - Numeric columns can be strings, or vice versa
    
## Categorical data
- Converting categorical data to ‘category’ dtype:
- Can make the DataFrame smaller in memory
- Can make them be utilized by other Python libraries for analysis 

### Converting data types
In this exercise, you'll see how ensuring all categorical variables in a DataFrame are of type `category` **reduces memory usage.**

The `tips` dataset has been loaded into a DataFrame called `tips`. This data contains information about how much a customer tipped, whether the customer was male or female, a smoker or not, etc.

Look at the output of `tips.info()`. You'll note that two columns that should be categorical - `sex` and `smoker` - are instead of type object, which is pandas' way of storing arbitrary strings. Your job is to convert these two columns to type `category` and note the reduced memory usage.

In [1]:
import pandas as pd

In [2]:
tips = pd.read_csv('../_datasets/tips.csv')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null object
smoker        244 non-null object
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 9.6+ KB


In [4]:
# Convert the sex column to type 'category'
tips.sex = tips.sex.astype('category')

# Convert the smoker column to type 'category'
tips.smoker = tips.smoker.astype('category')

# Print the info of tips
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 8.2+ KB


By converting `sex` and `smoker` to categorical variables, the memory usage of the DataFrame **went down from 9.6 KB to 8.2KB.** This may seem like a small difference here, but when **_you're dealing with large datasets, the reduction in memory usage can be very significant!_**

### Working with numeric data
If you expect the data type of a column to be numeric (`int` or `float`), but instead it is of type `object`, this typically means that there is a non numeric value in the column, which also signifies bad data.

You can use the `pd.to_numeric()` function to convert a column into a numeric data type. **If the function raises an error, you can be sure that there is a bad value within the column**. You can either use the techniques you learned in Chapter 1 to do some exploratory data analysis and find the bad value, or you can choose to ignore or `coerce` the value into a missing value, `NaN`.

A modified version of the tips dataset has been pre-loaded into a DataFrame called tips. For instructional purposes, it has been pre-processed to introduce some 'bad' data for you to clean. Use the `.info()` method to explore this. You'll note that the `total_bill` and `tip` columns, which should be numeric, are instead of type object. Your job is to fix this.

In [5]:
tips_bad = pd.read_csv('../_datasets/tips_bad.csv')
tips_bad.info()
tips_bad.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null object
tip           244 non-null object
sex           244 non-null object
smoker        244 non-null object
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: int64(1), object(6)
memory usage: 7.7+ KB


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,-,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [6]:
# Convert 'total_bill' to a numeric dtype
tips_bad['total_bill'] = pd.to_numeric(tips_bad['total_bill'], errors='coerce')

# Convert 'tip' to a numeric dtype
tips_bad['tip'] = pd.to_numeric(tips_bad['tip'], errors='coerce')

# Print the info of tips
print(tips_bad.info())
tips_bad.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    243 non-null float64
tip           243 non-null float64
sex           244 non-null object
smoker        244 non-null object
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 9.6+ KB
None


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


The `'total_bill'` and `'tip'` columns in this DataFrame are stored as `object` types because there are bad data in these columns. By coercing the values into a `numeric` type, they become proper `NaN` values.

# Using regular expressions to clean strings
## String manipulation
- Much of data cleaning involves string manipulation
- Most of the world’s data is unstructured text
- Also have to do string manipulation to make datasets consistent with one another

- Many built-in and external libraries
- ‘re’ library for regular expressions
    - A formal way of specifying a pa!ern
    - Sequence of characters
- Pattern matching
    - Similar to globbing

## Using regular expressions
- Compile the pattern
- Use the compiled pattern to match values
- This lets us use the pa!ern over and over again
- Useful since we want to match values down a column of values

### String parsing with regular expressions
In the video, Dan introduced you to the basics of regular expressions, which are powerful ways of defining patterns to match strings. This exercise will get you started with writing them.

When working with data, it is sometimes necessary to write a regular expression to look for properly entered values. Phone numbers in a dataset is a common field that needs to be checked for validity. Your job in this exercise is to define a regular expression to match US phone numbers that fit the pattern of `xxx-xxx-xxxx`.

The [regular expression module][1] in python is `re.` When performing pattern matching on data, since the pattern will be used for a match across multiple rows, it's better to compile the pattern first using `re.compile()`, and then use the compiled pattern to match values.

[1]:https://docs.python.org/3/library/re.html

In [7]:
# Import the regular expression module
import re

# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')

In [8]:
# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))

True


In [9]:
# See if the pattern matches
result2 = prog.match('1123-456-7890')
print(bool(result2))

False


### Extracting numerical values from strings
Extracting numbers from strings is a common task, particularly when working with unstructured data or log files.

Say you have the following string: `'the recipe calls for 6 strawberries and 2 bananas'`.

It would be useful to extract the `6` and the `2` from this string to be saved for later use when comparing strawberry to banana ratios.

When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use the `re.findall()` function. Dan did not discuss this in the video, but it is straightforward to use: You pass in a pattern and a string to `re.findall()`, and it will return a list of the matches.

- Write a pattern that will find all the numbers in the following string: `'the recipe calls for 10 strawberries and 1 banana'`. To do this:
    - Use the `re.findall()` function and pass it two arguments: the pattern, followed by the string.
    - `\d` is the pattern required to find digits. This should be followed with a `+` so that the previous element is matched one or more times. *This ensures that 10 is viewed as one number and not as 1 and 0.*
    - Print the matches to confirm that your regular expression found the values 10 and 1.

In [10]:
# Import the regular expression module
import re

# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')

# Print the matches
print(matches)

['10', '1']


### Pattern matching
In this exercise, you'll continue practicing your regular expression skills. For each provided string, your job is to write the appropriate pattern to match it.

Write patterns to match:
- A telephone number of the format xxx-xxx-xxxx. You already did this in a previous exercise.
- A string of the format: A dollar sign, an arbitrary number of digits, a decimal point, 2 digits.
    - Use `\$` to match the dollar sign, 
    - Use `\d*` to match an arbitrary number of digits, 
    - Use `\.` to match the decimal point, 
    - and use `\d{x}` to match `x` number of digits.
- A capital letter, followed by an arbitrary number of alphanumeric characters.
    - Use `[A-Z]` to match any capital letter 
    - followed by `\w*` to match an arbitrary number of alphanumeric characters.

In [11]:
# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)

True


In [12]:
# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))
print(pattern2)

True


In [13]:
# Write the third pattern
pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia'))
print(pattern3)

True


# Using functions to clean data
## Complex cleaning
- Cleaning step requires multiple steps
    - Extract number from string
    - Perform transformation on extracted number
- Python function

### Custom functions to clean data
You'll now practice writing functions to clean data.

The `tips` dataset has been loaded early into a DataFrame called `tips`. It has a `'sex'` column that contains the values `'Male'` or `'Female'`. Your job is to write a function that will recode `'Female'` to `0`, `'Male'` to `1`, and return `np.nan` for all entries of `'sex'` that are neither `'Female'` nor `'Male'`.

_**Recoding variables like this is a common data cleaning task**_. Functions provide a mechanism for you to abstract away complex bits of code as well as reuse code. This makes your code more readable and less error prone.

As Dan showed you in the videos, you can use the `.apply()` method to apply a function across entire rows or columns of DataFrames. However, note that each column of a DataFrame is a pandas Series. Functions can also be applied across Series. Here, you will apply your function over the `'sex'` column.

In [14]:
# Define recode_gender()
def recode_gender(gender):

    # Return 0 if gender is 'Female'
    if gender == 'Female':
        return 0
    
    # Return 1 if gender is 'Male'    
    elif gender == 'Male':
        return 1
    
    # Return np.nan    
    else:
        return 'NaN'

In [15]:
# Apply the function to the sex column
tips['recode'] = tips.sex.apply(recode_gender)

# Print the first five rows of tips
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,recode
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,1
2,21.01,3.5,Male,No,Sun,Dinner,3,1
3,23.68,3.31,Male,No,Sun,Dinner,2,1
4,24.59,3.61,Female,No,Sun,Dinner,4,0


### Lambda functions
You'll now be introduced to a powerful Python feature that will help you clean your data more effectively: lambda functions. Instead of using the `def` syntax that you used in the previous exercise, lambda functions let you make simple, one-line functions.

For example, here's a function that squares a variable used in an `.apply()` method:

``` Python
def my_square(x):
    return x ** 2
df.apply(my_square)
```
The equivalent code using a lambda function is:

``` Python
df.apply(lambda x: x ** 2)
```

The lambda function takes one parameter - the variable `x`. The function itself just squares `x` and returns the result, which is whatever the one line of code evaluates to. In this way, lambda functions can make your code concise and Pythonic.

The tips dataset has been pre-loaded into a DataFrame called `tips`. Your job is to clean its `'total_dollar'` column by removing the dollar sign. You'll do this using two different methods: With the `.replace()` method, and with regular expressions.

In [16]:
tips['total_dollar'] = '$'+tips.total_bill.apply(lambda x: str(x))

In [17]:
# Write the lambda function using replace
tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', ''))

# Write the lambda function using regular expressions
tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0])

# Print the head of tips
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,recode,total_dollar,total_dollar_replace,total_dollar_re
0,16.99,1.01,Female,No,Sun,Dinner,2,0,$16.99,16.99,16.99
1,10.34,1.66,Male,No,Sun,Dinner,3,1,$10.34,10.34,10.34
2,21.01,3.5,Male,No,Sun,Dinner,3,1,$21.01,21.01,21.01
3,23.68,3.31,Male,No,Sun,Dinner,2,1,$23.68,23.68,23.68
4,24.59,3.61,Female,No,Sun,Dinner,4,0,$24.59,24.59,24.59


# Duplicate and missing data
## Duplicate data
- Can skew results
- ‘.drop_duplicates()’ method
## Missing data
- Leave as-is
- Drop them
- Fill missing value
## Fill missing values with .fillna()
- Fill with provided value
- Use a summary statistic
## Fill missing values with a test statistic
- Careful when using test statistics to fill
- Have to make sure the value you are filling in makes sense
- Median is a be!er statistic in the presence of outliers

### Dropping duplicate data
Duplicate data causes a variety of problems. From the point of view of performance, they use up unnecessary amounts of memory and cause unneeded calculations to be performed when processing data. In addition, they can also bias any analysis results.

A dataset consisting of the performance of songs on the Billboard charts has been pre-loaded into a DataFrame called `billboard`. Your job in this exercise is to subset this DataFrame and then drop all duplicate rows.

In [18]:
billboard = pd.read_csv('../_datasets/billboard.csv', encoding = 'unicode_escape')
print(billboard.shape)
billboard.head()

(317, 83)


Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,78,63.0,49.0,...,,,,,,,,,,
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,15,8.0,6.0,...,,,,,,,,,,
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,71,48.0,43.0,...,,,,,,,,,,
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,41,23.0,18.0,...,,,,,,,,,,
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,57,47.0,45.0,...,,,,,,,,,,


In [19]:
# Melt billboard: 
billboard = pd.melt(billboard, id_vars=['year','artist.inverted','track','time','genre','date.entered','date.peaked'], var_name='week', value_name='rank')

print(billboard.shape)

# Print the head of billboard
billboard.head()

(24092, 9)


Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,week,rank
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,x1st.week,78.0
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,x1st.week,15.0
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,x1st.week,71.0
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,x1st.week,41.0
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,x1st.week,57.0


In [20]:
# Create the new DataFrame: tracks
tracks = billboard[['year', 'artist.inverted', 'track', 'time']]

# Print info of tracks
print(tracks.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24092 entries, 0 to 24091
Data columns (total 4 columns):
year               24092 non-null int64
artist.inverted    24092 non-null object
track              24092 non-null object
time               24092 non-null object
dtypes: int64(1), object(3)
memory usage: 470.6+ KB
None


In [21]:
# Drop the duplicates: tracks_no_duplicates
tracks_no_duplicates = tracks.drop_duplicates()

# Print info of tracks
print(tracks_no_duplicates.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 317 entries, 0 to 316
Data columns (total 4 columns):
year               317 non-null int64
artist.inverted    317 non-null object
track              317 non-null object
time               317 non-null object
dtypes: int64(1), object(3)
memory usage: 8.7+ KB
None


### Filling missing data
Here, you'll return to the `airquality` dataset from Chapter 2. It has been pre-loaded into the DataFrame `airquality`, and it has missing values for you to practice filling in. 

It's rare to have a (real-world) dataset without any missing values, and it's important to deal with them because certain calculations cannot handle missing values while some calculations will, by default, skip over any missing values.

Also, understanding how much missing data you have, and thinking about where it comes from is crucial to making unbiased interpretations of data.

In [22]:
airquality = pd.read_csv("../_datasets/airquality.csv")
print(airquality.info())
airquality.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      116 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB
None


Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


In [23]:
# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality.Ozone.mean()

# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality.Ozone.fillna(oz_mean)

# Print the info of airquality
print(airquality.info())

airquality.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      153 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB
None


Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,42.12931,,14.3,56,5,5


# Testing with asserts
## Assert statements
- Programmatically vs visually checking
- If we drop or fill NaNs, we expect 0 missing values
- We can write an assert statement to verify this
- We can detect early warnings and errors
- This gives us confidence that our code is running correctly

### Testing your data with asserts
Here, you'll practice writing assert statements using the Ebola dataset from previous chapters to programmatically check for missing values and to confirm that all values are positive. The dataset has been pre-loaded into a DataFrame called `ebola`.

In the video, you saw Dan use the `.all()` method together with the `.notnull()` DataFrame method to check for missing values in a column. The `.all()` method returns `True` if all values are `True`. When used on a DataFrame, it returns a Series of Booleans - one for each column in the DataFrame. So if you are using it on a DataFrame, like in this exercise, you need to chain another `.all()` method so that you return only one `True` or `False` value. When using these within an `assert statement`, nothing will be returned if the assert statement is true: This is how you can confirm that the data you are checking are valid.

Note: You can use `pd.notnull(df)` as an alternative to `df.notnull()`.

- Write an assert statement to confirm that there are no missing values in `ebola`.
    - Use the `pd.notnull()` function on `ebola` (or the `.notnull()` method of ebola) and chain two `.all()` methods (that is, `.all().all()`). The first `.all()` method will return a `True` or `False` for each column, while the second `.all()` method will return a single `True` or `False`.
    - Write an assert statement to confirm that all values in `ebola` are greater than or equal to `0`.
    - Chain two `all()` methods to the Boolean condition (`ebola >= 0`).

In [24]:
ebola = pd.read_csv('../_datasets/ebola.csv')
print(ebola.shape)
ebola.head()

(122, 18)


Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
0,1/5/2015,289,2776.0,,10030.0,,,,,,1786.0,,2977.0,,,,,
1,1/4/2015,288,2775.0,,9780.0,,,,,,1781.0,,2943.0,,,,,
2,1/3/2015,287,2769.0,8166.0,9722.0,,,,,,1767.0,3496.0,2915.0,,,,,
3,1/2/2015,286,,8157.0,,,,,,,,3496.0,,,,,,
4,12/31/2014,284,2730.0,8115.0,9633.0,,,,,,1739.0,3471.0,2827.0,,,,,


In [25]:
# Fill all the NaN values with 0
ebola = ebola.fillna(0)
ebola.head()

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
0,1/5/2015,289,2776.0,0.0,10030.0,0.0,0.0,0.0,0.0,0.0,1786.0,0.0,2977.0,0.0,0.0,0.0,0.0,0.0
1,1/4/2015,288,2775.0,0.0,9780.0,0.0,0.0,0.0,0.0,0.0,1781.0,0.0,2943.0,0.0,0.0,0.0,0.0,0.0
2,1/3/2015,287,2769.0,8166.0,9722.0,0.0,0.0,0.0,0.0,0.0,1767.0,3496.0,2915.0,0.0,0.0,0.0,0.0,0.0
3,1/2/2015,286,0.0,8157.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3496.0,0.0,0.0,0.0,0.0,0.0,0.0
4,12/31/2014,284,2730.0,8115.0,9633.0,0.0,0.0,0.0,0.0,0.0,1739.0,3471.0,2827.0,0.0,0.0,0.0,0.0,0.0


In [26]:
# Assert that there are no missing values
assert pd.notnull(ebola).all().all()

In [27]:
# Assert that all values are >= 0
assert (ebola >= 0).all().all()