### Converting data types

In this exercise, you'll see how ensuring all categorical variables in a DataFrame are of type category reduces memory usage.
The tips dataset contains information about how much a customer tipped, whether the customer was male or female, a smoker or not, etc.

You'll note that two columns that should be categorical - sex and smoker - are instead of type object, which is pandas' way of storing arbitrary strings. Your job is to convert these two columns to type category and note the reduced memory usage.

In [1]:
import pandas as pd

In [2]:
tips = pd.read_csv('Data/tips.csv', index_col=0)

print(tips.info())
tips.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null object
smoker        244 non-null object
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 15.2+ KB
None


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
# Convert the sex column to type 'category'
tips.sex = tips.sex.astype('category')

# Convert the smoker column to type 'category'
tips.smoker = tips.smoker.astype('category')

# Print the info of tips
print(tips.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 12.1+ KB
None


### Working with numeric data
If you expect the data type of a column to be numeric (int or float), but instead it is of type
object, this typically means that there is a non numeric value in the column, which also signifies
bad data.
You can use the `pd.to_numeric()` function to convert a column into a numeric data type. If the
function raises an error, you can be sure that there is a bad value within the column. 

you can choose to ignore or coerce the value into a missing value, NaN.

Use the `.info()` method to explore data. You'll note that the total_bill and tip columns, which
should be numeric, are instead of type object. Your job is to fix this.

In [4]:
# Read file into a DataFrame: df
tips_2= pd.read_csv("Data/tips_2.csv", index_col=0)

print(tips_2.shape)

# Print the head of the DataFrame
tips_2.head()

(244, 7)


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,,2.0
1,10.34,1.66,Male,No,Sun,Dinner,3.0
2,21.01,3.5,Male,No,Sun,Dinner,3.0
3,23.68,3.31,Male,No,Sun,Dinner,2.0
4,missing,3.61,Female,No,Sun,Dinner,4.0


In [5]:
tips_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null object
tip           244 non-null object
sex           234 non-null object
smoker        229 non-null object
day           243 non-null object
time          227 non-null object
size          231 non-null float64
dtypes: float64(1), object(6)
memory usage: 15.2+ KB


In [6]:
# Convert 'total_bill' to a numeric dtype
tips_2['total_bill'] = pd.to_numeric(tips_2['total_bill'], errors='coerce')

# Convert 'tip' to a numeric dtype
tips_2['tip'] = pd.to_numeric(tips_2['tip'], errors='coerce')

# Print the info of tips
print(tips_2.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    202 non-null float64
tip           220 non-null float64
sex           234 non-null object
smoker        229 non-null object
day           243 non-null object
time          227 non-null object
size          231 non-null float64
dtypes: float64(3), object(4)
memory usage: 15.2+ KB
None


* The 'total_bill' and 'tip' columns in this DataFrame are stored as object types because the string 'missing' is used in these columns to encode missing values. By coercing the values into a numeric type, they become proper NaN values.

### String parsing with regular expressions

When working with data, it is sometimes necessary to write a regular expression to look for
properly entered values. Phone numbers in a dataset is a common field that needs to be checked
for validity. Your job in this exercise is to define a regular expression to match US phone
numbers that fit the pattern of xxx-xxx-xxxx.

The regular expression module in python is re. When performing pattern matching on data,
since the pattern will be used for a match across multiple rows, it's better to compile the
pattern first using `re.compile()`, and then use the compiled pattern to match values.

In [7]:
# Import the regular expression module
import re

# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')

# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))

# See if the pattern matches
result = prog.match('1123-456-7890')
print(bool(result))

True
False


### Extracting numerical values from strings
Extracting numbers from strings is a common task, particularly when working with unstructured data or log files.
Say you have the following string: 'the recipe calls for 6 strawberries and 2 bananas'.
It would be useful to extract the 6 and the 2 from this string to be saved for later use when comparing strawberry to banana ratios.
When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use the `re.findall()` function. it is straightforward to use: You pass in a pattern and a string to `re.findall()`, and it will return a list of the matches.


In [8]:
# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')

# Print the matches
print(matches)

['10', '1']


In [9]:
# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)

# Write the second pattern
pattern2 = bool(re.match(pattern='^\$\d*\.\d{2}$', string='$123.45'))
print(pattern2)

# Write the third pattern
pattern3 = bool(re.match(pattern='w*', string='Australia'))
print(pattern3)

True
True
True


### Custom functions to clean data

You'll now practice writing functions to clean data.

It has a 'sex' column that contains the values 'Male' or 'Female'. Your job is to write a function that will recode 'Male' to 1, 'Female' to 0, and return np.nan for all entries of 'sex' that are neither 'Male' nor 'Female'.

Recoding variables like this is a common data cleaning task. Functions provide a mechanism for you to abstract away complex bits of code as well as reuse code. This makes your code more readable and less error prone.

you can use the `.apply()` method to apply a function across entire rows or columns of DataFrames. However, note that each column of a DataFrame is a pandas Series. Functions can also be applied across Series. Here, you will apply your function over the 'sex' column.


In [27]:
# Read file into a DataFrame: df
tips_3= pd.read_csv("Data/tips_3.csv")

# Print the head of the DataFrame
tips_3.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,,2.0
1,10.34,1.66,Male,No,Sun,Dinner,3.0
2,21.01,3.5,Male,No,Sun,Dinner,3.0
3,23.68,3.31,Male,No,Sun,Dinner,2.0
4,,3.61,Female,No,Sun,Dinner,4.0


In [28]:
import numpy as np

def recode_sex(sex):
    if sex=="Male":
        return 1
    elif sex=="Female":
        return 0
    else:
        return np.nan
    
tips_3["sex_recode"] = tips_3["sex"].apply(recode_sex)
tips_3.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sex_recode
0,16.99,1.01,Female,No,Sun,,2.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3.0,1.0
2,21.01,3.5,Male,No,Sun,Dinner,3.0,1.0
3,23.68,3.31,Male,No,Sun,Dinner,2.0,1.0
4,,3.61,Female,No,Sun,Dinner,4.0,0.0


### Lambda functions

You will clean 'total_dollar' column by removing the dollar sign. You'll do this using two different methods: With the `.replace()` method, and with regular expressions.

- Use the `.replace()` method inside a lambda function to remove the dollar sign from the 'total_dollar' column of tips.
    -You need to specify two arguments to the `.replace()` method: The string to be replaced ('$'), and the string to replace it by ('').
    -Apply the lambda function over the 'total_dollar' column of tips.
- Use a regular expression to remove the dollar sign from the 'total_dollar' column of tips.
    -The pattern has been provided for you: It is the first argument of the `re.findall()` function.
    -Complete the rest of the lambda function and apply it over the 'total_dollar' column of tips. ***Notice that because `re.findall()` returns a list, you have to slice it in order to access the actual value.***

In [38]:
# Read file into a DataFrame: df
tips_4= pd.read_csv("Data/tips_4.csv")

print(tips_4.info())
# Print the head of the DataFrame
tips_4 = tips_4.iloc[:,:8]
tips_4.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 10 columns):
total_bill              244 non-null float64
tip                     244 non-null float64
sex                     244 non-null object
smoker                  244 non-null object
day                     244 non-null object
time                    244 non-null object
size                    244 non-null int64
total_dollar            244 non-null object
total_dollar_replace    244 non-null float64
total_dollar_re         244 non-null float64
dtypes: float64(4), int64(1), object(5)
memory usage: 19.2+ KB
None


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,total_dollar
0,16.99,1.01,Female,No,Sun,Dinner,2,$16.99
1,10.34,1.66,Male,No,Sun,Dinner,3,$10.34
2,21.01,3.5,Male,No,Sun,Dinner,3,$21.01
3,23.68,3.31,Male,No,Sun,Dinner,2,$23.68
4,24.59,3.61,Female,No,Sun,Dinner,4,$24.59


In [39]:
# Write the lambda function using replace
tips_4["total_dollar_replace"] = tips_4["total_dollar"].apply(lambda x:x.replace("$", ""))
tips_4.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,total_dollar,total_dollar_replace
0,16.99,1.01,Female,No,Sun,Dinner,2,$16.99,16.99
1,10.34,1.66,Male,No,Sun,Dinner,3,$10.34,10.34
2,21.01,3.5,Male,No,Sun,Dinner,3,$21.01,21.01
3,23.68,3.31,Male,No,Sun,Dinner,2,$23.68,23.68
4,24.59,3.61,Female,No,Sun,Dinner,4,$24.59,24.59


In [43]:
# Write the lambda function using regular expressions
tips_4['total_dollar_re'] = tips_4.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0])
tips_4.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,total_dollar,total_dollar_replace,total_dollar_re
0,16.99,1.01,Female,No,Sun,Dinner,2,$16.99,16.99,16.99
1,10.34,1.66,Male,No,Sun,Dinner,3,$10.34,10.34,10.34
2,21.01,3.5,Male,No,Sun,Dinner,3,$21.01,21.01,21.01
3,23.68,3.31,Male,No,Sun,Dinner,2,$23.68,23.68,23.68
4,24.59,3.61,Female,No,Sun,Dinner,4,$24.59,24.59,24.59


### Dropping duplicate data

- Create a new DataFrame called tracks that contains the following columns from billboard: 'year', 'artist', 'track', and 'time'.
- Print the info of tracks. 
- Drop duplicate rows from tracks using the `.drop_duplicates()` method. Save the result to tracks_no_duplicates.
- Print the info of tracks_no_duplicates.

In [49]:
# Read file into a DataFrame: df
billboard= pd.read_csv("Data/billboard.csv")

# Print the head of the DataFrame
billboard.head()

Unnamed: 0,year,artist,track,time,date.entered,week,rank
0,2000,2Ge+her,The Hardest Part Of ...,3:15,2000-09-02,wk1,91.0
1,2000,2 Pac,Baby Don't Cry,4:22,2000-02-26,wk1,87.0
2,2000,3 Doors Down,Kryptonite,3:53,2000-04-08,wk1,81.0
3,2000,3 Doors Down,Loser,4:24,2000-10-21,wk1,76.0
4,2000,504 Boyz,Wobble Wobble,3:35,2000-04-15,wk1,57.0


In [50]:
# Create the new DataFrame: tracks
tracks = billboard[['year','artist','track','time']]

# Print info of tracks
print(tracks.info())

# Drop the duplicates: tracks_no_duplicates
tracks_no_duplicates = tracks.drop_duplicates()

# Print info of tracks
print(tracks_no_duplicates.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24092 entries, 0 to 24091
Data columns (total 4 columns):
year      24092 non-null int64
artist    24092 non-null object
track     24092 non-null object
time      24092 non-null object
dtypes: int64(1), object(3)
memory usage: 753.0+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 317 entries, 0 to 316
Data columns (total 4 columns):
year      317 non-null int64
artist    317 non-null object
track     317 non-null object
time      317 non-null object
dtypes: int64(1), object(3)
memory usage: 12.4+ KB
None


### Filling missing data

It's rare to have a (real-world) dataset without any missing values, and it's important to deal with them because certain calculations cannot handle missing values while some calculations will, by default, skip over any missing values.
Also, understanding how much missing data you have, and thinking about where it comes from is crucial to making unbiased interpretations of data.

- Calculate the mean of the Ozone column of airquality using the `.mean()` method on airquality.Ozone.
- Use the `.fillna()` method to replace all the missing values in the Ozone column of airquality with the mean, oz_mean.

In [57]:
# Read file into a DataFrame: df
airquality= pd.read_csv("Data/airquality.csv")

print(airquality.info())

# Print the head of the DataFrame
airquality.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      116 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.3 KB
None


Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


In [58]:
# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality.Ozone.mean()

# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality.Ozone.fillna(oz_mean)

# Print the info of airquality
print(airquality.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      153 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.3 KB
None


### Testing your data with asserts
Here, you'll check for missing values and to confirm that all values are positive. 

The `.all()` method returns True if all values are True. When used on a DataFrame, it returns a Series of Booleans - one for each column in the DataFrame. So if you are using it on a DataFrame, like in this exercise, you need to chain another `.all()` method so that you return only one True or False value. When using these within an assert statement, nothing will be returned if the assert statement is true: This is how you can confirm that the data you are checking are valid.
Note: You can use `pd.notnull(df)` as an alternative to `df.notnull()`.

- Write an assert statement to confirm that there are no missing values in ebola.
    -Use the `pd.notnull()` function on ebola (or the `.notnull()` method of ebola) and chain two `.all()` methods (that is, `.all().all())`. The first `.all()` method will return a True or False for each column, while the second `.all()` method will return a single True or False.
- Write an assert statement to confirm that all values in ebola are greater than or equal to 0.
    -Chain two `all()` methods to the Boolean condition (ebola >= 0).

In [70]:
# Read file into a DataFrame: df
ebola= pd.read_csv("Data/ebola.csv", parse_dates=["Date"], index_col="Date")

print(ebola.info())

# Print the head of the DataFrame
ebola.head()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 122 entries, 2015-01-05 to 2014-03-22
Data columns (total 17 columns):
Day                    122 non-null int64
Cases_Guinea           122 non-null float64
Cases_Liberia          122 non-null float64
Cases_SierraLeone      122 non-null float64
Cases_Nigeria          122 non-null float64
Cases_Senegal          122 non-null float64
Cases_UnitedStates     122 non-null float64
Cases_Spain            122 non-null float64
Cases_Mali             122 non-null float64
Deaths_Guinea          122 non-null float64
Deaths_Liberia         122 non-null float64
Deaths_SierraLeone     122 non-null float64
Deaths_Nigeria         122 non-null float64
Deaths_Senegal         122 non-null float64
Deaths_UnitedStates    122 non-null float64
Deaths_Spain           122 non-null float64
Deaths_Mali            122 non-null float64
dtypes: float64(16), int64(1)
memory usage: 17.2 KB
None


Unnamed: 0_level_0,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2015-01-05,289,2776.0,0.0,10030.0,0.0,0.0,0.0,0.0,0.0,1786.0,0.0,2977.0,0.0,0.0,0.0,0.0,0.0
2015-01-04,288,2775.0,0.0,9780.0,0.0,0.0,0.0,0.0,0.0,1781.0,0.0,2943.0,0.0,0.0,0.0,0.0,0.0
2015-01-03,287,2769.0,8166.0,9722.0,0.0,0.0,0.0,0.0,0.0,1767.0,3496.0,2915.0,0.0,0.0,0.0,0.0,0.0
2015-01-02,286,0.0,8157.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3496.0,0.0,0.0,0.0,0.0,0.0,0.0
2014-12-31,284,2730.0,8115.0,9633.0,0.0,0.0,0.0,0.0,0.0,1739.0,3471.0,2827.0,0.0,0.0,0.0,0.0,0.0


In [71]:
# Assert that there are no missing values
assert ebola.notnull().all().all()

# Assert that all values are >= 0
assert (ebola >= 0).all().all()


Since the assert statements did not throw any errors, we can be sure that there is no missing value in the dataframe and all values are >= 0.