## Cleaning Data in Python



## Course Description

It's commonly said that data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time analyzing it. The time spent cleaning is vital since analyzing dirty data can lead you to draw inaccurate conclusions. Data cleaning is an essential task in data science. Without properly cleaned data, the results of any data analysis or machine learning model could be inaccurate. In this course, you will learn how to identify, diagnose, and treat a variety of data cleaning problems in Python, ranging from simple to advanced. You will deal with improper data types, check that your data is in the correct range, handle missing data, perform record linkage, and more!

##  Common data problems
Free
0%

In this chapter, you'll learn how to overcome some of the most common dirty data problems. You'll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.

    Data type constraints    50 xp
    Common data types    100 xp
    Numeric data or ... ?    100 xp
    Summing strings and concatenating numbers    100 xp
    Data range constraints    50 xp
    Tire size constraints    100 xp
    Back to the future    100 xp
    Uniqueness constraints    50 xp
    How big is your subset?    50 xp
    Finding duplicates   100 xp
    Treating duplicates    100 xp


##  Text and categorical data problems
0%

Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.

    Membership constraints    50 xp
    Members only    100 xp
    Finding consistency    100 xp
    Categorical variables    50 xp
    Categories of errors    100 xp
    Inconsistent categories    100 xp
    Remapping categories    100 xp
    Cleaning text data    50 xp
    Removing titles and taking names    100 xp
    Keeping it descriptive    100 xp


##  Advanced data problems
0%

In this chapter, you’ll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You’ll also gain invaluable skills that will help you verify that values have been added correctly and that missing values don’t negatively impact your analyses.

    Uniformity    50 xp
    Ambiguous dates    50 xp
    Uniform currencies    100 xp
    Uniform dates    100 xp
    Cross field validation    50 xp
    Cross field or no cross field?    100 xp
    How's our data integrity?    100 xp
    Completeness    50 xp
    Is this missing at random?    50 xp
    Missing investors    100 xp
    Follow the money    100 xp


##  Record linkage
0%

Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you'll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.

    Comparing strings    50 xp
    Minimum edit distance    50 xp
    The cutoff point    100 xp
    Remapping categories II    100 xp
    Generating pairs    50 xp
    To link or not to link?    100 xp
    Pairs of restaurants    100 xp
    Similar restaurants    100 xp
    Linking DataFrames    50 xp
    Getting the right index    50 xp
    Linking them together!    100 xp
    Congratulations!    50 xp


## Data type constraints







**The institutor's name is Adel, he'll be our host as we learn how to clean data in Python.  

# *******************************************************************************************************************
# In this course, we're going to understand how to diagnose different problems in our data and how they can come up during out workflow.  We will also understand the side effects of not treating our data correctly.  And various ways to address different types of dirty data.  


In this chapter, we're going to discuss the most common data problems you may encounter and how to address them.  

# To understand why we need to clean data, lets remind ourselves of the data science workflow.  In a typical data science workflow, we usually access our raw data, explore and process it, develop insights using visualizations or predictive models, and finally report these insights with dashboards or reports.  


Access Data --> Explore and Process Data --> Extract Insights --> Report Insights


# Dirty data can appear because of duplicate values, miss-spellings, data type parsing errors and legacy systems.  
Without making sure that data is properly clearned in the exploration and processing phase, we will surely compromise the insights and reports subsequently generated.  As the old adage says, garbage in garbage out.  


When working with data, there are various types that we may encounter along the way.  We could be working with text data, integers, decimals, dates, zip codes, and others.  Luckily, Python has specific data type objects for various data types that you're probably familiar with by now.  This makes it much easier to manipulate these various data types in Python.  
    
    Text data, Integers, Decimals, Binary, Dates, Categories
    str, int, float, bool, datetime, cateory


# *******************************************************************************************************************
# As such, before preparing to analyze and extract insights from our data, we need to make sure our variables have the correct data types, other wise we risk compromising our anaysis.  Lets take a look at the following example.  


Belwo is a head od a DF containing revenue generated and quantity of items sold for a sales order.  We want to calculate the total revenue generated by all sales orders.  As you can see, the Revenue dolumn has the dollar sign on the righthand side.  A close inspection of the DF columns' data types using the ".dtypes" attribute returns object for the Revenue column, which is what Pandas uses to store strings.  We can also check the data types as well as the number of missing values per column in a DataFrame, by using the ".info()" method.  

Since the Revenue column is a string, summing across all sales order returns one large concatenated string containing each row's string.  To fix this, we need to first remove the $ sign from the string so that Pandas is able to convert the strings into numbers without error (how about we define it at the first place? need to do some test).  

# *******************************************************************************************************************
To do this with the "sales['Revenue'].str.strip('$')" method, while specifying the string we want to strip as an argument which is in this case the dollar sign.  
Since our dollar value do not contain decimals, we then convert the Revenue column to an integer by using the ".astype()" method , specifying the desired data type as argument.  Had our revenue value been decimal, we would have converted the Revenue column to float.  
We can make sure that the Revenue column is now an integer by using the "assert" statement, which takes in a condition as input, and returns nothing if that condition is met, and ar error is it is not.  
You can test almost anything you can image of by using "assert", and we'll see more ways to utilize it as we go along the course.  
# *******************************************************************************************************************


# A common type of data seems numeric but actually represents cateories with a finite set of possible categories.  This is called categorical data.  
We will look more closely at categorical data in Chapter 2, but lets take a look at this example.  Here we have a marriage status column, which is represented by 0 for never married, 1 for married, 2 for separated, and 3 for divorced.  

However it will be imported of type integer, which could lead to misleading results when trying to extract some statistical summaries.  We can solve this by using the same ".astype()" method seen earlier, but this time specifying the "category" data type.  When applying the ".describe()" method again, we see that the summary statistics are much more aligned with that of a categorical variable, discussinng the number of observations, number of unique values, most frequent category instead of mean and standard deviation.  


Now that we have solid understanding of data type constrains - lets get to practice.  



# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Remove $ from Revenue column
sales['Revenue'] = sales['Revenue'].str.strip('$')
sales['Revenue'] = sales['Revenue'].astype('int')

# Verify that Revenue is now an integer
assert sales['Revenue'].dtypes == 'int'



# Convert to categorical
df['marriage_status'] = df['marriage_status'].astype('category')
df.describe()

In [None]:
sales = pd.read_csv('sales.csv')

print(sales.head())


---------------------------------------------
  SalesOrderId  | Revenue     | Quantity
0        43659  |     23153$  |       12
1        43660  |      1457$  |        2

In [12]:
price = '23153$'


n_price = price.strip('$')
print(n_price)
print(type(n_price))

print(int(n_price))
print(type(int(n_price)))


#assert int(n_price).dtype() == 'int'
assert type(int(n_price)) == 'int'

23153
<class 'str'>
23153
<class 'int'>


AssertionError: 

## Common data types

Manipulating and analyzing data with incorrect data types could lead to compromised analysis as you go along the data science workflow.

When working with new data, you should always check the data types of your columns using the .dtypes attribute or the .info() method which you'll see in the next exercise. Often times, you'll run into columns that should be converted to different data types before starting any analysis.

In this exercise, you'll first identify different types of data and correctly map them to their respective types.
Instructions
100XP

    Assign each card to what type of data you think it is.

Numeric data type: Salary earned monthly, 
                   Number of items bought in a basket, 
                   Number of points on customer loyalty card

Text:              City of residence, 
                   First name
                   Shipping address of a customer

Dates:             Birthdates of clients, 
                   Order date of a product, 


## Numeric data or ... ?

In this exercise, and throughout this chapter, you'll be working with bicycle ride sharing data in San Francisco called "ride_sharing". It contains information on the start and end stations, the trip duration, and some user information for a bike sharing service.

The "user_type" column contains information on whether a user is taking a free ride and takes on the following values:

    1 for free riders.
    2 for pay per ride.
    3 for monthly subscribers.

In this instance, you will print the information of ride_sharing using .info() and see a firsthand example of how an incorrect data type can flaw your analysis of the dataset. The pandas package is imported as pd.
Instructions 1/3
35 XP

    Question 1
    Print the information of ride_sharing.
#    Use .describe() to print the summary statistics of the user_type column from ride_sharing.
    
    
    
    Question 2
#    By looking at the summary statistics - they don't really seem to offer much description on how users are distributed along their purchase type, why do you think that is?
Possible Answers
    -The user_type column is not of the correct type, it should be converted to str.
    -The user_type column has an infinite set of possible values, it should be converted to category.
#    -The user_type column has an finite set of possible values that represent groupings of data, it should be converted to category.     Also other 'int64' values, station_A_id, bike_id, user_birth_year, user_gender
# *******************************************************************************************************************
# *******************************************************************************************************************



    Question 3
#    -Convert "user_type" into categorical by assigning it the 'category' data type and store it in the "user_type_cat" column.
#    -Make sure you converted "user_type_cat" correctly by using an "assert" statement.


In [44]:
import pandas as pd


df = pd.read_csv('ride_sharing_new.csv', index_col=0)
print(df.dtypes)

print(df.head(3))



df['user_type'] = df['user_type'].astype('category')

#print(dir(df['user_type']))

print(df['user_type'].dtypes)
assert df['user_type'].dtype == 'category' # ------------------------------------------------------------------------


print('user_type describe')
print(df['user_type'].describe())


print(len(df[df['user_type']==2]))


# *******************************************************************************************************************
# filt_list = movies_genres.loc[movies_genres['_merge']=='both', 'title'].unique()
print(df.loc[df['user_type']==2, 'user_type'])

# *******************************************************************************************************************
print(df[df['user_type']==2]['user_type'].copy())




print(df.info())


print(df.describe())

duration           object
station_A_id        int64
station_A_name     object
station_B_id        int64
station_B_name     object
bike_id             int64
user_type           int64
user_birth_year     int64
user_gender        object
dtype: object
     duration  station_A_id  \
0  12 minutes            81   
1  24 minutes             3   
2   8 minutes            67   

                                      station_A_name  station_B_id  \
0                                 Berry St at 4th St           323   
1       Powell St BART Station (Market St at 4th St)           118   
2  San Francisco Caltrain Station 2  (Townsend St...            23   

                    station_B_name  bike_id  user_type  user_birth_year  \
0               Broadway at Kearny     5480          2             1959   
1  Eureka Valley Recreation Center     5193          2             1965   
2    The Embarcadero at Steuart St     3652          3             1993   

  user_gender  
0        Male  
1        Male

## Summing strings and concatenating numbers

In the previous exercise, you were able to identify that category is the correct data type for user_type and convert it 
# in order to extract relevant statistical summaries that shed light on the distribution of user_type.

# Another common data type problem is importing what should be numerical values as strings, 
as mathematical operations such as summing and multiplication lead to string concatenation, not numerical outputs.

In this exercise, you'll be converting the string column duration to the type int. Before that however, you will need to make sure to strip "minutes" from the column in order to make sure pandas reads it as numerical. The pandas package has been imported as pd.
Instructions
100 XP

#    Use the ".strip()" method to strip duration of "minutes" and store it in the "duration_trim" column.
    Convert "duration_trim" to int and store it in the "duration_time" column.
    Write an "assert" statement that checks if duration_time's data type is now an int.
    Print the average ride duration.


In [59]:

df = pd.read_csv('ride_sharing_new.csv', index_col=0)
print(df.dtypes)

df['user_type'] = df['user_type'].astype('category')


print(df['duration'].head())


df['duration'] = df['duration'].str.strip('minutes')

df['duration'] = df['duration'].astype('int')
print(df['duration'].head())

print('\n')
print(df['duration'].mean)

print(df.info())

duration           object
station_A_id        int64
station_A_name     object
station_B_id        int64
station_B_name     object
bike_id             int64
user_type           int64
user_birth_year     int64
user_gender        object
dtype: object
0    12 minutes
1    24 minutes
2     8 minutes
3     4 minutes
4    11 minutes
Name: duration, dtype: object
0    12
1    24
2     8
3     4
4    11
Name: duration, dtype: int64


<bound method NDFrame._add_numeric_operations.<locals>.mean of 0        12
1        24
2         8
3         4
4        11
         ..
25755    11
25756    10
25757    14
25758    14
25759    29
Name: duration, Length: 25760, dtype: int64>
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25760 entries, 0 to 25759
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   duration         25760 non-null  int64   
 1   station_A_id     25760 non-null  int64   
 2   station_A_name   25760 non-nul

In [None]:
# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].____.____()

# Convert duration to integer
ride_sharing['duration_time'] = ____

# Write an assert statement making sure of conversion
assert ride_sharing['____'].____ == '____'

# Print formed columns and calculate average ride duration 
print(ride_sharing[['duration','duration_trim','duration_time']])
print(____)

## Data range constraints





# **Hi and welcome back.  In this lesson, we're going to discuss data that should fall within a range.  Lets first start off with some motivation.  

Imagine we have a dataset of movies with their respective average rating from a streaming service.  The rating can be any integer between 1 and 5.  After creating a histogram with Matplotlib, we see that there are a few movies with an average rating of 6, which is well above the allowable range. 
# This is most likely an error in data collection or parsing, where a variable is well beyond its range and treating it is essential to have accurate analysis.  

Here is another example, where we see subscription dates in the future for a service.  
# Inherently this doesn't make any sence, as well cannot sign up for a service in the future, but these errors exist either due to technical or human error.  
We use the datetime package's ".date.today()" function to get today's date, and we filter the dateset by any subscription date higher than today's date.  



# *******************************************************************************************************************
# We need to pay attention to the range of our data. 


# *******************************************************************************************************************
# There's a variety of options to deal with out of range data.  

(1) The simplest option is to drop the data.  
However, depending on the size of your out of range data, you could be loosing out on essential information.  As a rule of thumb, only drop data when a small proportion of your dataset is affected by out of range values, however you really need to understand your dataset before deciding to drop values.  

(2) Another option would be setting custom minimums or maximums to your columns.  

(3) We could also set the data to be missing, and impute it, but we'll take a look at how to deal with missing data in Chapter 3.  

(4) We could also, dependent on the business assumptions behind our data, assign a custom value for any values of our data that go beyond a certain range.  


# *******************************************************************************************************************
Lets take a look at the movies example mentioned earlier.  We first isolate the movies with ratings higher than 5.  Now if these values are affect a small set of our data, we can drop them.  We can drop them is 2 ways - we can either create a new filtered movies DF where we only keep values of "avg_rating" lower than or equal to 5.  Or drop the values by using the ".drop()" method.  The ".drop()" method takes in as argument the row indices of movies for which the "avg_rating" is higher than 5.  We set the "inplace=" argument to True so that values are dropped in place and we don't have to create a new column.  We can make sure this is set in place using assert statement that checks if the maximumof avg_rating is lower lan or equal to 5.  

Depending on the assumptions behind our data, we can also change the out of range values to a hard limit.  For example, here we're setting any value of the avg_rating column to 5 if it goes beyond it.  We can do this using the ".loc[]" method, which returns all cells that fit a custom row and column index.  It takes as first argument the row index, or here all instaces of avg_rating above 5, and second argument the column index, which is here the avg_rating column.  Aagin, we can make sure that this change was done using an assert statement.  


# *******************************************************************************************************************
Lets take another look at the date range example mentioned earlier, where we have subscriptions happening in the future.  We first look at the datetypes (.dtype) of the column with the ".dtype" attribute.  We can confirm that the subscription_date column is an "object" and not a "datetime" object.  Datetime objects allow much earsier manipulation of date data, so lets convert it to that.  We do so with "pd.to_datetime()" function from Pandas, which takes in as argument the column we want to convert.  We can then test the data type conversion by asserting that the subscription date's column is equal to "datetime64[ns]", which is how the data type is represented in Pandas.  

Now that the column is in datetime, we can treat it in a variety of ways.  We first create a today_date variable using the datetime function ".date.today()", which allows us to store today's date.  We can then either drop the rows with exceeding dates similar to how we did in the average rating example, or replace exceeding values with todays date.  



import matplotlib.pyplot as plt

plt.hist(movies['avg_rating'])
plt.title('Average rating of movies (1-5)')



# Import date time
import datetime

today_date = datetime.date.today()
user_signups[user_signups['subscription_date'] > today_date]



import pandas as pd
# Output Movies with range > 5
movies[movies['avg_rating'] > 5]



# Drop values using filtering
movies = movies[movies['avg_rating'] <= 5]

# Drop values using .drop()
movies.drop(movies[movies['avg_rating'] > 5].index, inplace = True)  #***********************************************

assert movies['avg_rating'].max() == 5



# Convert avg_rating > 5 to value 5
movies.loc[movies['avg_rating']>5, 'avg_rating'] = 5



# Convert to DataTime
user_signups['subscription_date'] = pd.to_datetime(user_signups['subscription_date'])

# Assert that conversion happened
assers user_signups['subscription_date'].dtype == 'datetime64[ns]'  #################################################


<1> Drop the data
today_date = datetime.date.today()

# Drop values using filtering
user_signups = user_signups[user_signups['subscription_date'] < today_date]
# Drop values using .drop()
user_signups.drop(user_signups[user_signups['subscription_date'] < today_date].index, inplace = True)  ##############
# *******************************************************************************************************************

<2> Hardcode dates with upper limit

# Replace values using filtering
user_signups.loc[user_signups['subscription_date'] > today_date, 'subscription_date'] = today_date
# Assert is true
assert user_signups['subscription_date'].max().date() <= today_date

In [76]:
url = 'http://www.omdbapi.com/?t=hackers'
url = 'http://www.omdbapi.com/?i=tt3896198&apikey=8c241015'
url = 'http://www.omdbapi.com/?s=Batman&page=1&apikey=8c241015'
url = 'http://www.omdbapi.com/?s=Batman&page=2&apikey=8c241015'




import json
import requests

r = requests.get(url)
j_data = r.json()

print(j_data)



titles = []
with requests.get(url) as r:
    j_data = r.json()
    content = j_data['Search']
    for i in content:
        titles.append(i['Title'])
    

print(titles)

{'Search': [{'Title': 'Batman: The Killing Joke', 'Year': '2016', 'imdbID': 'tt4853102', 'Type': 'movie', 'Poster': 'https://m.media-amazon.com/images/M/MV5BMTdjZTliODYtNWExMi00NjQ1LWIzN2MtN2Q5NTg5NTk3NzliL2ltYWdlXkEyXkFqcGdeQXVyNTAyODkwOQ@@._V1_SX300.jpg'}, {'Title': 'Batman: Mask of the Phantasm', 'Year': '1993', 'imdbID': 'tt0106364', 'Type': 'movie', 'Poster': 'https://m.media-amazon.com/images/M/MV5BYTRiMWM3MGItNjAxZC00M2E3LThhODgtM2QwOGNmZGU4OWZhXkEyXkFqcGdeQXVyNjExODE1MDc@._V1_SX300.jpg'}, {'Title': 'Batman: The Dark Knight Returns, Part 2', 'Year': '2013', 'imdbID': 'tt2166834', 'Type': 'movie', 'Poster': 'https://m.media-amazon.com/images/M/MV5BYTEzMmE0ZDYtYWNmYi00ZWM4LWJjOTUtYTE0ZmQyYWM3ZjA0XkEyXkFqcGdeQXVyNTA4NzY1MzY@._V1_SX300.jpg'}, {'Title': 'Batman: Year One', 'Year': '2011', 'imdbID': 'tt1672723', 'Type': 'movie', 'Poster': 'https://m.media-amazon.com/images/M/MV5BNTJjMmVkZjctNjNjMS00ZmI2LTlmYWEtOWNiYmQxYjY0YWVhXkEyXkFqcGdeQXVyNTAyODkwOQ@@._V1_SX300.jpg'}, {'Title': 'Ba

In [80]:

df = pd.read_csv('ride_sharing_new.csv', index_col=0)
print(df.head())


df['user_type'] = df['user_type'].astype('category')

print(df.info())


df.describe()

#help(df.drop)

     duration  station_A_id  \
0  12 minutes            81   
1  24 minutes             3   
2   8 minutes            67   
3   4 minutes            16   
4  11 minutes            22   

                                      station_A_name  station_B_id  \
0                                 Berry St at 4th St           323   
1       Powell St BART Station (Market St at 4th St)           118   
2  San Francisco Caltrain Station 2  (Townsend St...            23   
3                            Steuart St at Market St            28   
4                              Howard St at Beale St           350   

                    station_B_name  bike_id  user_type  user_birth_year  \
0               Broadway at Kearny     5480          2             1959   
1  Eureka Valley Recreation Center     5193          2             1965   
2    The Embarcadero at Steuart St     3652          3             1993   
3     The Embarcadero at Bryant St     1883          1             1979   
4             8th

Unnamed: 0,station_A_id,station_B_id,bike_id,user_birth_year
count,25760.0,25760.0,25760.0,25760.0
mean,31.023602,89.558579,4107.621467,1983.054969
std,26.409263,105.144103,1576.315767,10.010992
min,3.0,3.0,11.0,1901.0
25%,15.0,21.0,3106.0,1978.0
50%,21.0,58.0,4821.0,1985.0
75%,67.0,93.0,5257.0,1990.0
max,81.0,383.0,6638.0,2001.0


## Tire size constraints

In this lesson, you're going to build on top of the work you've been doing with the ride_sharing DataFrame. You'll be working with the "tire_sizes" column which contains data on each bike's tire size.

Bicycle tire sizes could be either 26″, 27″ or 29″ and are here correctly stored as a categorical value. In an effort to cut maintenance costs, the ride sharing provider decided to set the maximum tire size to be 27″.

# In this exercise, you will make sure the tire_sizes column has the correct range by first converting it to an integer, then setting and testing the new upper limit of 27″ for tire sizes.
Instructions
100 XP

    Convert the "tire_sizes" column from category to 'int'.
    Use ".loc[]" method to set all values of tire_sizes above 27 to 27.
    Reconvert back tire_sizes to 'category' from int.
    Print the description of the tire_sizes.


In [3]:
import pandas as pd

ride_sharing = pd.read_csv('ride_sharing_new.csv')

ride_sharing['duration'] = ride_sharing['duration'].str.strip('minutes')
ride_sharing['duration'].astype('int')

ride_sharing['user_type'] = ride_sharing['user_type'].astype('category')

print(ride_sharing.head())



# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_size'].astype('int')

# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_size'] > 27, 'tire_size'] = 27  # ***********************************************

# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')  # ***************************************

# Print tire size description
print(ride_sharing['tire_sizes'].describe())

   Unnamed: 0 duration  station_A_id  \
0           0      12             81   
1           1      24              3   
2           2       8             67   
3           3       4             16   
4           4      11             22   

                                      station_A_name  station_B_id  \
0                                 Berry St at 4th St           323   
1       Powell St BART Station (Market St at 4th St)           118   
2  San Francisco Caltrain Station 2  (Townsend St...            23   
3                            Steuart St at Market St            28   
4                              Howard St at Beale St           350   

                    station_B_name  bike_id user_type  user_birth_year  \
0               Broadway at Kearny     5480         2             1959   
1  Eureka Valley Recreation Center     5193         2             1965   
2    The Embarcadero at Steuart St     3652         3             1993   
3     The Embarcadero at Bryant St     188

NameError: name '____' is not defined

## Back to the future

# *******************************************************************************************************************
A new update to the data pipeline feeding into the "ride_sharing" DataFrame has been updated to register each ride's date. This information is stored in the "ride_date" column of the type "object", which represents strings in pandas.

# A bug was discovered which was relaying rides taken today as taken next year (think other conditions, last year?).
To fix this, you will find all instances of the "ride_date" column that occur anytime in the future, and set the maximum possible value of this column to today's date. Before doing so, you would need to convert "ride_date" to a datetime object.
# *******************************************************************************************************************

The datetime package has been imported as dt, alongside all the packages you've been using till now.
Instructions
100 XP

    Convert ride_date to a datetime object and store it in ride_dt column using to_datetime().
    Create the variable today, which stores today's date by using the "dt.date.today()" function.
#    For all instances of "ride_dt" in the future, set them to today's date.
    Print the maximum date in the ride_dt column.


In [None]:
import datetime
import pandas as pd


today_date = datetime.date.today()

ride_sharing['ride_date'] = ride_sharing['ride_date'].astype('datetime64[ns]')


ride_sharing['ride_date'] = ride_sharing.loc[ride_sharing['ride_date'] > today_date, 'ride_date'] = today_date
ride_sharing['ride_date'] = ride_sharing[ride_sharing['ride_date']>today_date]['ride_date'] = today_date


# Your thinking makes you smarter, not stareing at the answer, you learn nothing in looking at it, but learn a lot in figuring it out.  Choose the right way

In [None]:
# Convert ride_date to datetime
ride_sharing['ride_dt'] = pd.____(____['____'])

# Save today's date
today = ____

# Set all in the future to today's date
ride_sharing.____[____['____'] > ____, '____'] = ____

# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].____())

## Uniqueness constraints






**Hi and welcome to the final lesson of this chapter.  Lets discuss another common data clearning problem, duplicate values.  Duplicate values can be diagnosed when we have the same exact information repeated across multiple row, for some or all column of our DF.  

In this example DF containing the names, address, height, and weight of individuals, the rows represented have identical values across all columns.  In this one there are duplicate values for all columns except the height column, which leads us to think its more likely a data entry error than an actural other person.  

# *******************************************************************************************************************
Apart from data entry and human errors alluded to in the previous slide, duplicate data can also arise because of bugs and design errors wheather in business processes or data pipelines.  However, they often most arise form the necessary act of joining and consolidating data from various resources, which could retain duplicate values.  


# Lets first see how to find duplicate values.  "df.duplicated()" method --> boolean indexing
In this example we're working with a bigger version of the height and weight data seen earlier in this video.  We can find duplicates in a DF by using "df.duplicated()" method.  It returns a Series of boolean values that are True for duplicated values, and False for non-duplicated values.  
We can see exactly which rows are affected by using boolean indexing (df[df.duplicated()]).  However using "df.duplicated()" without playing around with the arguments of the method can lead to misleading results, as all of the columns are required to have duplicated values by default, with all duplicated values being marked as True except for the first occurence.  
# Here, can I recall how we do the DF merger and concatnate operations?  Think and think
This limit our ability to properly diagnose what type of duplication we have, how to effectively treat it.  To properly calibrate how we go about finding duplicates, we will use 2 arguments from the "df.duplicated()" method.  The "subset=" argument lets us set a list of column names to check for duplication.  For example, it can allows us to find duplicates for the first and last name columns only.  The "keep=" argument lets us keep the first occurrence of a duplicate value by setting it to the tring "first", "last", or keep all occurances of duplicated values by setting it to "False".  In below example, we're checking for duplicates across the first name, last name and address columns
and we choosing to keep all duplicates.  


# *******************************************************************************************************************
We see the following results, to get a better bird's eye view of the duplicates, we sort the duplicated rows using "df.sort_values(by='first_name')" method, choosing "first_name" to sort by.  We find that there are 4 sets of duplicated rows, the first 2 being complete duplicates of each other across all columns.  The other 2 being incomplete duplicates of each other with discrepancies across height and weight respectively.  

# ".drop_duplicates()" method
The complete duplicates can be treated easily.  All that required is keep one of them only and discard the others.  This can be down with the ".drop_duplicates()" method, which also takes in the same "subset=" and "keep=" argument asin the ".duplicated()" method, as well as the "inplace=" argument which drops the duplicated values directly inside the height_weight DF.  Here we are droping complete duplicates only, so its not necessary nor adviable to set a "subset=", and since the "keep=" argument takes in "first" as default, we can keep it as such.  Note that we can also set it as "last", but not as "False" as it would keep all duplicates.  

This leaves us with the other 2 sets of duplicates discussed earlier, which are the same for first_name, last_name and address, but contain discrepancies in height and weight.  

# Apart from droping rows with really small discrepancies, we can use a statistical measure to combine each set of duplicated values.  
For example, we can combine these 2 rows into 1 by computing the average mean between them, or the maximum, or other statistical measures, this is highly dependent on a common sense understanding of our data, and what type of data we have.  We can do this easily using the ".groupby()" method, which when chained with ".agg()" method, let you group by a set of common columns and return statistical values for specific columns when the aggregation is being performed. 

For example here, we created a dictionary called summaries, which instructs groupby to return the maximum of duplicated rows for the height column, and the mean duplicated rows for the weight column.  
We then ".groupby()" height_weight by the column names defined earlier, and chained it with the "agg()" method, which takes in the summaries dictionary we created.  
We chain this entire line with the ".reset_index()" method, so that we can have numbered indices in the final output. 
We can verify that there are no more duplicate values by running the ".duplicated()" method again, and use brackets to output duplicate rows.  


Now that we have a solid grasp of dupliccation, lets practice.  


---------------------------------------------------------------------------------------------
first_name   | last_name   | address                                    | height   | weight
Justin       | Saddlemyer  | Boulevard du Jardin Botainque 3, Bruxelles | 193 cm   | 87 kg
Justin       | Saddlemyer  | Boulevard du Jardin Botainque 3, Bruxelles | 194 cm   | 87 kg




Data Entry & Human Error

Bugs and design errors

Join or merge Errors



# Column names to check for duplication
column_names = ['first_name', 'last_name', 'address']

duplicates = height_weight.duplicated(subset=column_names, keep=False)  #############################################

print(height_weight[duplicates])

----------------------------------------------------- Image what will be the output of it




# Output duplicate values
height_weight[duplicates].sort_values(by='first_name')   # Think why choosing f_name, ll_name, address set duplicate 

----------------------------------------------------------------------------------------
     first_name  | last_name  | address                              | height  | weight
 22        Cole  |    Palmer  |                     8366 At, Street  |    178  | 91
102        Cole  |    Palmer  |                     8366 At, Street  |    178  | 91
 28     Desirae  |   Shannon  | P.O. Box 643, 5251 Consectetuer, Rd. |    195  | 83
103     Desirae  |   Shannon  | P.O. Box 643, 5251 Consectetuer, Rd. |    196  | 83
  1        Ivor  |    Pierce  |                   102-3364 Non. Road |    168  | 66
101        Ivor  |    Pierce  |                   102-3364 Non. Road |    168  | 88
 37        Mary  |     Colon  |                         4674 Ut Rd.  |    179  | 75
100        Mary  |     Colon  |                         4674 Ut Rd.  |    179  | 75
         
         
# Drop duplicates
height_weight.drop_duplicates(inplace=True)  ########################################################################


height_weight.groupby(['first_name', 'last_name', 'address']).agg(np.mean)  ??

# Groupby column names and produce statistical summaries
column_names = ['first_name', 'last_name', 'address']
summaries = {'height': 'max', 'weight': 'mean'}
height_weight = height_weight.groupby(by=column_names).agg(summaries).reset_index()  ################################

# Make sure aggregation is done
duplicates = height_weight.duplicated(subset=column_names, keep=False)
height_weight[duplicates].sort_values(by='first_name')

## How big is your subset?

You have the following loans DataFrame which contains loan and credit score data for consumers, and some metadata such as their first and last names. You want to find both complete and incomplete duplicates using ".duplicated()" method.
first_name 	last_name 	    credit_score 	has_loan
Justin 	    Saddlemeyer 	600 	        1
Hadrien 	Lacroix 	    450 	        0

Choose the correct usage of ".duplicated()" below:
Answer the question
50XP
Possible Answers

    loans.duplicated()
      Because the default method returns both complete and incomplete duplicates.  X Full duplicates in every col
    press
    1
    loans.duplicated(subset = 'first_name')
      Because constraining the duplicate rows to the first name lets me find incomplete duplicates as well.  X
    press
    2
    loans.duplicated(subset = ['first_name', 'last_name'], keep = False)
      Because subsetting on consumer metadata and not discarding any duplicate returns all duplicated rows.
    press   X set "keep=" arg to False will keep all duplicates
    3
#    loans.duplicated(subset = ['first_name', 'last_name'], keep = 'first')
      Because this drops all duplicates.
    press
    4
    
# The value we set for "keep=" arg is for "mark" boolean, not keep the records
# *******************************************************************************************************************
    keep : {'first', 'last', False}, default 'first'
        Determines which duplicates (if any) to mark.
    
        - ``first`` : Mark duplicates as ``True`` except for the first occurrence.
        - ``last`` : Mark duplicates as ``True`` except for the last occurrence.
        - False : Mark all duplicates as ``True``.
    

In [6]:
help(pd.DataFrame.duplicated)

Help on function duplicated in module pandas.core.frame:

duplicated(self, subset: 'Hashable | Sequence[Hashable] | None' = None, keep: "Literal['first'] | Literal['last'] | Literal[False]" = 'first') -> 'Series'
    Return boolean Series denoting duplicate rows.
    
    Considering certain columns is optional.
    
    Parameters
    ----------
    subset : column label or sequence of labels, optional
        Only consider certain columns for identifying duplicates, by
        default use all of the columns.
    keep : {'first', 'last', False}, default 'first'
        Determines which duplicates (if any) to mark.
    
        - ``first`` : Mark duplicates as ``True`` except for the first occurrence.
        - ``last`` : Mark duplicates as ``True`` except for the last occurrence.
        - False : Mark all duplicates as ``True``.
    
    Returns
    -------
    Series
        Boolean series for each duplicated rows.
    
    See Also
    --------
    Index.duplicated : Equivalent met

## Finding duplicates

# A new update to the data pipeline feeding into "ride_sharing" has added the "ride_id" column, which represents a unique identifier for each ride.

# *******************************************************************************************************************
The update however coincided with radically shorter average ride duration times and irregular user birth dates set in the future. Most importantly, the number of rides taken has increased by 20% overnight, leading you to think there might be both complete and incomplete duplicates in the "ride_sharing" DataFrame.
# *******************************************************************************************************************

In this exercise, you will confirm this suspicion by finding those duplicates. A sample of "ride_sharing" is in your environment, as well as all the packages you've been working with thus far.
Instructions
100 XP

    Find duplicated rows of "ride_id"  in the "ride_sharing" DataFrame while setting "keep=" arg to False.
    Subset "ride_sharing" on duplicates and sort by "ride_id" and assign the results to "duplicated_rides".
    Print the "ride_id", "duration" and "user_birth_year" columns of "duplicated_rides" in that order.


In [None]:
duplicates = ride_sharing.duplicated(subset='ride_id', keep=False)


duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')


print(duplicated_rides[])

In [None]:
# Find duplicates
duplicates = ____.____(____, ____)

# Sort your duplicated rides
duplicated_rides = ride_sharing[____].____('____')

# Print relevant columns of duplicated_rides
print(duplicated_rides[['____','____','____']])

## Treating duplicates

In the last exercise, you were able to verify that the new update feeding into "ride_sharing" contains a bug generating both complete and incomplete duplicated rows for some values of the "ride_id" column, with occasional discrepant values for the "user_birth_year" and "duration" columns.

# *******************************************************************************************************************
In this exercise, you will be treating those duplicated rows by first dropping complete duplicates, and then merging the incomplete duplicate rows into one while keeping the average duration, and the minimum user_birth_year for each set of incomplete duplicate rows.
# *******************************************************************************************************************
Instructions
100 XP

    Drop complete duplicates in "ride_sharing" and store the results in "ride_dup".
#    Create the statistics dictionary which holds minimum aggregation for "user_birth_year" and mean aggregation for "duration".
#    Drop incomplete duplicates by grouping by "ride_id" and applying the aggregation in statistics.
    Find duplicates again and run the assert statement to verify de-duplication.


In [None]:
ride_dup = ride_sharing.drop_duplicates()  # No need set "inplace=" arg to True as we passing it top a varianble


summaries = {'user_birth_year': 'min', 'duration': 'mean'}


ride_unique = ride_dup.groupby('ride_id').agg(summaries).reset_index()


duplicates = ride_unique.duplicated(subset='ride_id', keep=False)

duplicates_rides = ride_unique[duplicates]




In [None]:
# Drop complete duplicates from ride_sharing
ride_dup = ____.____()

# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': ____, 'duration': ____}

# Group by ride_id and compute new statistics
ride_unique = ride_dup.____('____').____(____).reset_index()

# Find duplicated values again
duplicates = ride_unique.____(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]

# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0  ##############################################################################

## Membership constraints






**Good work on chapter 1.  We are now equipped to treat more complex, and specific data cleaning problems.  In this chapter, we're going to take a look at common data problems with text and categorical data.  So lets get started.  

In this lesson, we'll focus on categorical variables.  As discussed early in chapter 1, categorical data represent variables that represent predefined finite set of categories.  Examples of this range from marriage status, household income categories, loan status and others.  

# To run machine learning models on categorical data, they are often coded as numbers.  
Since categorical data represent a predefined set of categories, they can't have values that go beyond these predefined categories.  We can have inconsistencies in our categorical data for a variety of reasons.  This could be due to data entry issues with free text vs dropdown fields, data parsing errors and other types of errors.  

There's a variety of ways we can treat these, with increasingly specific solutions for different types of inconsistencies.  Most simply, we can drop the rows with incorrect categories.  We can attempt remapping incorrect categories to correct ones, and more.  We'll see a variety of ways of dealing with this throughout the chapter and the course, but for now we'll just focus on dropping data.  


# *******************************************************************************************************************
Lets first look at an example. Here is a DataFrame named study_data containing a list of first names, birth dates, and blood types.  Additional, a DataFrame named categories, containing the correct possible categories for the bloodtype column has been created as well.  Notice the inconsistency here?  There's definitely no bloodtype named Z+.  Luckily, the categories DataFrame will help us systematically spot all rows with these inconsistencies.  

# Its always good practice to keep a log of all possible values of your categorical data, as it will make dealing with these types of inconsistencies way easier.  

# *******************************************************************************************************************
Now before moving on to dealing with these inconsistent values, lets have a brief reminder on joins.  The two main types of joins we care about here are anti-joins and inner-joins.  We join DataFrames on common columns between them. 

# *******************************************************************************************************************
The anti-joins take in 2 DataFrames A and B, and return data from 1 DataFrame that is not contained in another. 
Imagine a example we performing a left anti-joins of DF A and B, and are returning the columns of DataFrames A and B for values only found in A of the common column between them being joined on.  

Inner-joins returns only the data that is contained DataFrames.  For example, an inner-join of A and B would return columns from both DataFrames for values only found in A and B, of the common column between them being joined on.  
# *******************************************************************************************************************


In our example, an left anti-join essentially returns all the data in study data with inconsistent bloodtypes, and an inner-join returns all the rows containing consistent bloodtypes.  Now lets see how to do that in Python.  We first gety all inconsistent categories in the blood_type column of the study_data DataFrame.  We do that by creating a set of the blood_type column which stores its unique values, and use the ".difference()" method which takes in as argument the blood_type column from the categories DataFrame.  This returns all the categories in blood_type that are not in categories.  We then find the inconsistent rows by finding all the rows of the blood_type columns that are equal to inconsistent categories by using the ".isin()" method, this returns a series of boolean values that are True for inconsistent rows and False for consistent ones.  We then subset the study_data DF based on the boolean indexing, and viola we have our inconsistent data.  

To drop inconsistent rows and keep ones that are only consistent.  We just use the tilde symbol (~) while subsetting which rturns everything except inconsistent rows.  


Now thatwe know about treating categorical data, lets practice.  



# Read study data and print in
study_data = pd.read_csv('study.csv')
print(stydy_data)

-------------------------------------------------
    name       | birthday     | bloodtype
1   Beth       | 2019-10-20   | B-
2   Ignatius   | 2020-07-08   | A-
3   Paul       | 2019-08-12   | O+
4   Helen      | 2019-03-17   | O-
5   Jennifer   | 2019-12-17   | Z+   <---
6   Kennedy    | 2020-04-27   | A+
7   Keityh     | 2019-04-19   | AB+


# Correct possible blood types
print(categories)

-----------------------
    bloodtype
1   O-
2   O+
3   A-
4   A+
5   B+
6   B-
7   AB+
8   AB-


# *******************************************************************************************************************

inconsistent_categories = set(study_data['bloodtype']).difference(categories['bloodtype'])  #########################
print(inconsistent_categories)

{'z'}

# Get and print rows with inconsistent categories
inconsistens_rows = study_data['bloodtype'].isin(inconsistent_categories)

inconsistene_data = study_data[inconsistent_rows]

# Drop inconsistent categories and get consistent data only
consistent_data = study_data[~inconsistent_rows]


## Members only

Throughout the course so far, you've been exposed to some common problems that you may encounter with your data, from data type constraints, data range constrains, uniqueness constraints, and now membership constraints for categorical values.

In this exercise, you will map hypothetical problems to their respective categories.
Instructions
100XP

    Map the data problem observed with the correct type of data problem.

Hint

#    Remember, a membership constraint is when a categorical column has values that are not in the predefined set of categories of your column.

Other Constraint: 
     A "revenue" column represented as a string
     A "birthdate" column with value in the future
     An "age" column with value above "130"


Membership Constraint:
     A "month" column with the value "14"
     A "day_of_week" column with the value "Suntermonday"
     A "GPA" column containing a "z-" grade
     A "has_loan" column with the value "12"


## Finding consistency

# *******************************************************************************************************************
In this exercise and throughout this chapter, you'll be working with the airlines DataFrame which contains survey responses on the San Francisco Airport from airline customers.

The DataFrame contains flight metadata such as the airline, the destination, waiting times as well as answers to key questions regarding cleanliness, safety, and satisfaction. Another DataFrame named categories was created, containing all correct possible values for the survey columns.
# *******************************************************************************************************************

In this exercise, you will use both of these DataFrames to find survey answers with inconsistent values, and drop them, effectively performing an outer and inner join on both these DataFrames as seen in the video exercise. The pandas package has been imported as pd, and the airlines and categories DataFrames are in your environment.
Instructions 1/4
35 XP

    Question 1
#    Print the categories DataFrame and take a close look at all possible correct categories of the survey columns.
#    Print the unique values of the survey columns in airlines using the ".unique()" method.
    
    
    
    Question 2
#    Print the unique values of the survey columns in airlines using the .unique() method.
    
    
    
    Question 3
#    Create a set out of the cleanliness column in airlines using set() and find the inconsistent category by finding the difference in the cleanliness column of categories.
    
    
    
    Question 4
    Find rows of airlines with a cleanliness value not in categories and print the output.


In [None]:
dict_val_lis = {'col_key1': ['val_11', 'val_12', 'val_13', 'val_14'],
                'col_key2': ['val_21', 'val_22', 'val_23', 'val_24'],
                'col_key3': ['val_31', 'val_32', 'val_33', 'val_34']}

# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

list_dict = [{'country': 'Brazil', 'capital': 'Brasilia', 'area': 8.516, 'population': 200.4}, 
             {'country':'Russia', 'capital':'Moscow', 'area':17.10, 'population':143.5}, 
             {'country':'India', 'capital':'New Delhi', 'area':3.286, 'population':1252}, 
             {'country':'China', 'capital':'Beijing', 'area':9.597, 'population':1357}, 
             {'country':'South Africa','capital':'Pretoria','area':1.221, 'population':52.98}]

import pandas as pd
putcome = pd.DataFrame(list_dict)
print(putcome)

In [40]:
import pandas as pd


airlines = pd.read_csv('airlines_final.csv', index_col=0)
print(airlines.head(), "\n")


# *******************************************************************************************************************
dirty_airlines = pd.DataFrame({'cleanliness': ['cool'], 'safety': ['good quality'], 'satisfaction': ['best ever']})

airlines = pd.concat([airlines, dirty_airlines])#, ignore_index=True)
# *******************************************************************************************************************

print(airlines.tail(), "\n")


print(type(airlines['cleanliness'].unique()))
print(airlines['cleanliness'].unique())
print(airlines['safety'].unique())
print(airlines['satisfaction'].unique(), "\n")


categories = pd.DataFrame({'cleanliness': ['Clean', 'Somewhat clean', 'Average', 'Somewhat dirty', 'Dirty'], 
                          'safety': ['Very safe', 'Somewhat safe', 'Neutral', 'Very unsafe', 'Somewhat unsafe'], 
                          'satisfaction': ['Very satisfied', 'Somewhat satsified', 'Neutral', 'Somewhat unsatisfied', 'Very unsatisfied']})

print(categories, '\n')

#print(airlines[['cleanliness', 'safety', 'satisfaction']].unique())



# *******************************************************************************************************************
# So lets image we'll drop every row when a column has inconsistent value
# *******************************************************************************************************************
bad_cleanliness = set(airlines['cleanliness']).difference(categories['cleanliness'])  ###############################
print(bad_cleanliness)

inconsistens_cleanliness_rows = airlines['cleanliness'].isin(bad_cleanliness)

print(inconsistens_cleanliness_rows[-6:], '\n')

good_cleanliness = airlines[~inconsistens_cleanliness_rows]
print(good_cleanliness.tail())

     id        day      airline        destination    dest_region dest_size  \
0  1351    Tuesday  UNITED INTL             KANSAI           Asia       Hub   
1   373     Friday       ALASKA  SAN JOSE DEL CABO  Canada/Mexico     Small   
2  2820   Thursday        DELTA        LOS ANGELES        West US       Hub   
3  1157    Tuesday    SOUTHWEST        LOS ANGELES        West US       Hub   
4  2992  Wednesday     AMERICAN              MIAMI        East US       Hub   

  boarding_area   dept_time  wait_min     cleanliness         safety  \
0  Gates 91-102  2018-12-31     115.0           Clean        Neutral   
1   Gates 50-59  2018-12-31     135.0           Clean      Very safe   
2   Gates 40-48  2018-12-31      70.0         Average  Somewhat safe   
3   Gates 20-39  2018-12-31     190.0           Clean      Very safe   
4   Gates 50-59  2018-12-31     559.0  Somewhat clean      Very safe   

         satisfaction  
0      Very satisfied  
1      Very satisfied  
2             Neutra

In [None]:
    # Find the cleanliness category in airlines not in categories
    cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

    # Find rows with that category
    cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

    # Print rows with inconsistent category
    print(airlines[cat_clean_rows])

    -> airlines의 cleanliness 컬럼의 중복을 제거한 리스트로 만들어, 차집합을 통해 겹치지 않는 값을
    cat_clean에 저장한다. 이때 겹치는 값을 제외한 Unacceptable만 저장된다.

    -> 그리고 cleanliness 컬럼의 값이 cat_clean에 포함되어있는지를 불리언의 형태로
    cat_clean_rows에 저장하였다.

    -> 마지막으로 이 값들이 포함되어있는 행들을 출력한다.

## Categorical variables





**Awesome work on the last lesson.  Now lets discuss other types of problems that could affect categorical variables.  In the last lesson, we saw how categorical data has a value membership constraint, where columns need to have a predefined set of values.  However this is not the only set of problems we may encounter.  

When cleaning categorical data, some of the problems we may encounter include value inconsistency, the presence of too many categories that could be collapsed into one, and making sure data is of the right type.  Lets start with making sure our categorical data is consistent.  

# *******************************************************************************************************************
# A common categorical data problem is having values that slightly differ because of capitalizations.  
Not treating this could lead tyo misleading results when we dicide to analyze the data, for example, lets assume we're working with a demographics dataset, and we have a marriage status column with inconsistent capitalization.  Below is what counting the number of married people in the marriage_status Series would look like.  

Note that ".value_counts()" method works on Series only.  For a DF, we can "df.groupby()" the column and use the ".count()" method.  

To deal with this, we can either capitalize or lowercase the marriage_status column.  This can be done with the "str.upper()" or "str.lower()" function respectively.  


# *******************************************************************************************************************
# Another common problem with categorical values are leading or trailing spaces.  
For example, imagine the same demographics DF containing values with leading spaces.  Below is what the counts of married vs unmarried people would look like.  Note that there is a married category with a trailing space on the right, which makes it hard to sport on the output, as opposed to unmarried.  

To remove leading and trailing spaces, we can use the "str.strip()" method, which when giving no input, strips all leading and trailing white spaces.  



# *******************************************************************************************************************
# Sometimes, we may want to create categories out of our data, such as creating household income groups from income data.  
To create categories out of data, lets use the example of creating an income group column in the demographics DataFrame.  (do you remember how we did this using "pd.cut()" & "pd.qcut()")  

# "pd.qcut(df['column'], q=, labels=)"
We can do this in 2 ways.  The first method utilizes the "pd.qcut()" function from Pandas, which automatically divides our data based on its distribution into the number of categories we set in the "q=" argument, we created the category names in the group_names list and fed it to "labels=" argument, returning the following.  

Notice the first row actually misrepresents the actual income of the income group, as we didn't instruct qcut where our ranges actually lie.  

# "pd.cut(df['column'], bins=, labels=)"
We can do this with "pd.cut()" function instead, which lets us define category cutoff ranges with the bins argument.  It takes in a list of cutoff points for each category, with the final one being infinity represented with "np.inf()" from NumPy.  From the output, we can we can see that this is much more correct.  



# *******************************************************************************************************************
# Sometimes, we may want to reduce the amount of categories we have in our data.  Lets move on to mapping categories to fewer ones.  

For example, assume we have a column containing the operating system of different devices, and contains these unique values: Microsoft, MacOS, IOS, Android, Linux.  Say we want to collapse these categories into 2, DesktopOS, and MobileOS.  

# We can do this by using the ".replace()" method.  
It takes in a dictionary that maps each existing category to the category name you desire.  In this case, this is a mapping dictionary.  A quick print of the unique values of operating system shows the mapping has been complete.  


Now that we know about treating categorical data, lets practice.  



 (1) Value inconsistency
     Inconsistent fields: 'married', 'Maried', 'UNMARRIED', 'not married' ...
     Trailing white spaces: 'married', ' married '
(11) Collapsing too many categories to few
     Creating new groups: "0-20k", "20-40k" categories from continuous household income data
     Mapping groups to new ones: Mapping household income categories to "rich", "poor"


# Get marriage status column
marriage_status = demographics['marriage_status']

marriage_status.value_counts()

unmarried   352
married     268
MARRIED     204
UNMARRIED   176
dtype: int64


 unmarried   352
married      268
MARRIED      204
UNMARRIED    176
dtype: int64



# Using  "pd.qcut()"
group_names = ['0-200k', '200-500k', '500k+']

demographics["income_group"] = pd.qcut(demographics['household_income'], q=3, labels=group_names)

# Print income_group column
demographics[['income_group', 'household_income']]

-----------------------------------------
    income_group  | household_income
    200-500k      | 189243
    500k+         | 778533
    


# Using "pd.cut()" - creat category ranges and names
ranges = [0, 200000, 500000, np.inf]  ###############################################################################

group_names = ['0-200k', '200-500k', '500k+']

# Create income group column
demographics['income_group'] = pd.cut(demographics['household_income'], bins=ranges, labels=group_names)

-----------------------------------------
    income_group  | household_income
    0-200k        | 189243
    500k+         | 778533
    
    
    
# Create a mapping dictionary and replace
mapping = {"Microsoft": "DesktopOS", "MacOS": "DesktopOS", "Linux": "DesktopOS", "IOS": "MobileOS", "Android": "MobileOS"}

devices["operating_system"] = devices["operating_system"].replace(mapping) ##########################################
# *******************************************************************************************************************

devices["operating_system"].unique()

## Categories of errors

In the video exercise, you saw how to address common problems affecting categorical variables in your data, including white spaces and inconsistencies in your categories, and the problem of creating new categories and mapping existing ones to new ones.

To get a better idea of the toolkit at your disposal, you will be mapping functions and methods from pandas and Python used to address each type of problem.
Instructions
100XP

    Map each function/method to the categorical data problem it solves.

White spaces and inconsistency:
     .str.upper()
     .str.lower()
     .str.strip()

Creating or remapping categories:
     pd.cut()
     .replace()
     pd.qcut()


##  Inconsistent categories

In this exercise, you'll be revisiting the airlines DataFrame from the previous lesson.

As a reminder, the DataFrame contains flight metadata such as the airline, the destination, waiting times as well as answers to key questions regarding cleanliness, safety, and satisfaction on the San Francisco Airport.

In this exercise, you will examine two categorical columns from this DataFrame, dest_region and dest_size respectively, assess how to address them and make sure that they are cleaned and ready for analysis. The pandas package has been imported as pd, and the airlines DataFrame is in your environment.
Instructions 1/4
25 XP

    Question 1
    Print the unique values in "dest_region" and "dest_size" respectively.
    
    
    
    Question 2
    From looking at the output, what do you think is the problem with these columns?
    Possible Answers
    The "dest_region" column has only inconsistent values due to capitalization.
#    The "dest_region" column has inconsistent values due to capitalization and has one value that needs to be remapped.    yes, the 'euro' should be mapped to 'Europe'
    The dest_size column has only inconsistent values due to leading and trailing spaces.
    
    
    
    Question 3
#    Change the capitalization of all values of "dest_region" to lowercase.
#    Replace the 'eur' with 'europe' in dest_region using the ".replace()" method.
    
    
    
    Question 4
#    Strip white spaces from the "dest_size" column using the ".strip()" method.
#    Verify that the changes have been into effect by printing the unique values of the columns using .unique() .


In [15]:
import pandas as pd


airlines = pd.read_csv('airlines_final.csv')
print(airlines.head(), '\n')


print(airlines['dest_region'].unique(), '\n')

print(airlines['dest_size'].unique(), '\n')



airlines['dest_region'] = airlines['dest_region'].str.lower()


airlines['dest_region'] = airlines['dest_region'].replace({'eur': 'europe'})  #######################################


print(airlines['dest_region'].unique(), '\n')


airlines['dest_size'] = airlines['dest_size'].str.strip()  ##########################################################

airlines['dest_size'] = airlines['dest_size'].astype('category')  ###################################################

print(airlines['dest_size'].unique(), '\n')
print(airlines['dest_size'].dtype)

   Unnamed: 0    id        day      airline        destination    dest_region  \
0           0  1351    Tuesday  UNITED INTL             KANSAI           Asia   
1           1   373     Friday       ALASKA  SAN JOSE DEL CABO  Canada/Mexico   
2           2  2820   Thursday        DELTA        LOS ANGELES        West US   
3           3  1157    Tuesday    SOUTHWEST        LOS ANGELES        West US   
4           4  2992  Wednesday     AMERICAN              MIAMI        East US   

  dest_size boarding_area   dept_time  wait_min     cleanliness  \
0       Hub  Gates 91-102  2018-12-31     115.0           Clean   
1     Small   Gates 50-59  2018-12-31     135.0           Clean   
2       Hub   Gates 40-48  2018-12-31      70.0         Average   
3       Hub   Gates 20-39  2018-12-31     190.0           Clean   
4       Hub   Gates 50-59  2018-12-31     559.0  Somewhat clean   

          safety        satisfaction  
0        Neutral      Very satisfied  
1      Very safe      Very satis

## Remapping categories

To better understand survey respondents from airlines, you want to find out if there is a relationship between certain responses and the day of the week and wait time at the gate.

The airlines DataFrame contains the "day" and "wait_min" columns, which are categorical and numerical respectively. The day column contains the exact day a flight took place, and wait_min contains the amount of minutes it took travelers to wait at the gate. To make your analysis easier, you want to create two new categorical variables:

    wait_type: 'short' for 0-60 min, 'medium' for 60-180 and long for 180+
    day_week: 'weekday' if day is in the weekday, 'weekend' if day is in the weekend.

The pandas and numpy packages have been imported as pd and np. Let's create some new categorical data!
Instructions
100 XP

    Create the ranges and labels for the "wait_type" column mentioned in the description.
#    Create the "wait_type" column by from wait_min by using "pd.cut()", while inputting "label_ranges" and "label_names" in the correct arguments.
#    Create the mapping dictionary mapping weekdays to 'weekday' and weekend days to 'weekend'.
#    Create the day_week column by using .replace().


In [29]:
print(airlines.head(), '\n')


print(airlines['day'].dtypes, '\n')

print(airlines['day'].unique(), '\n')

airlines['day'] = airlines['day'].astype('category')
print(airlines['day'].dtypes, '\n')



import numpy as np

wait_ranges = [0, 60, 180, np.inf]
wait_labels = ['short', 'medium', 'long']

airlines['wait_type'] = pd.cut(airlines['wait_min'], bins=wait_ranges, labels=wait_labels)


day_mapping = {'Tuesday':'weekday', 'Friday':'weekday', 'Thursday':'weekday',  'Wednesday':'weekday',
               'Saturday':'weekend', 'Sunday':'weekend', 'Monday':'weekday'}

airlines['day_week'] = airlines['day'].replace(day_mapping)  ########################################################


print(airlines['day_week'].unique(), '\n')
print(airlines['wait_type'].unique(), '\n')

print(airlines.sample(7))

   Unnamed: 0    id        day      airline        destination    dest_region  \
0           0  1351    Tuesday  UNITED INTL             KANSAI           asia   
1           1   373     Friday       ALASKA  SAN JOSE DEL CABO  canada/mexico   
2           2  2820   Thursday        DELTA        LOS ANGELES        west us   
3           3  1157    Tuesday    SOUTHWEST        LOS ANGELES        west us   
4           4  2992  Wednesday     AMERICAN              MIAMI        east us   

  dest_size boarding_area   dept_time  wait_min     cleanliness  \
0       Hub  Gates 91-102  2018-12-31     115.0           Clean   
1     Small   Gates 50-59  2018-12-31     135.0           Clean   
2       Hub   Gates 40-48  2018-12-31      70.0         Average   
3       Hub   Gates 20-39  2018-12-31     190.0           Clean   
4       Hub   Gates 50-59  2018-12-31     559.0  Somewhat clean   

          safety        satisfaction wait_type day_week  
0        Neutral      Very satisfied    medium  week

In [None]:
# Create ranges for categories
label_ranges = [0, 60, ____, np.inf]
label_names = ['short', ____, ____]

# Create wait_type column
airlines['wait_type'] = pd.____(____, bins = ____, 
                                labels = ____)

# Create mappings and replace
mappings = {'Monday':'weekday', 'Tuesday':'____', 'Wednesday': '____', 
            'Thursday': '____', '____': '____', 
            'Saturday': 'weekend', '____': '____'}

airlines['day_week'] = airlines['day'].____(mappings)

## Cleaning text data





**Good job on the previous lesson.  In the final lesson of this chapter, we'll talk about text data and regular expressions.  Text data is on of the most common types of data types.  Examples of it range from names, phone numbers, addresses, emails and more.  

Common text data problems include handling inconsistencies, making sure text data is of a certain length, typos and others.  Lets take a look at the following example.  Below is a DF named phones containing the full name and phone numbers of individuals.  Both are string columns.  

Notice the phone number column.  We can see that there are phone number values, that begin with 00 or +.  We also see  that there are one of entry where the phone number is 4 digits, which is non-existent.  Furthermore, we can see that there are dashes across the phone number column.  

If we wanted to feed these phone numbers into an automated call system, or create a report discussing the distribution of users by area code, we couldn't really do so without uniform phone numbers.  

Ideally, we'd want to the phone number column s such.  Where all phone numbers are aligned to begin with 00, where any number below the 10 digit value is replaced with NaN to represent a missing value, and where all dashes have been removed.  Lets see how thats done.  

# ".str.replace()" on string data and ".str.len()" to subset & indexi damaged data to fill as NaN
# *******************************************************************************************************************
Lets first begin by replacing the plus sign with 00, to do this, we use the ".str.replace()" method with takes in 2 values, the string being replaced, which is in this case the plus sign and the string to replace it with which is is this case 00.  We can see that the column has been updated.  We use the same exact technique to dashes, by replacing the dash symbol with an empty string.  Now finally, we're going to replace all phone numbers below 10 digits to NaN.  We can do this by chaining the Phone number column with the ".str.len()", which returns the string length of each row in the column.  We can he use the ".loc[]" method, to index rows where digits is below 10, and replace the value of Phone number with NumPy's nan object.  

We can also write assert statement top test whether the Phone number column has a specific length, and whether it contains the symbols we removed.  The first assert statement tests that the minimum length of the strings in the Phone number column, found through str.len(), is bigger than or equal to 10.  In the second assert statement, we use the "str.contains()" methpd to test wheather the Phone number column contains a specific pattern.  It returns a series of booleans that are Ture for matches and False for non-matches.  We set the pattern "+|-", the bar (|) pipe here is basically an or statement, so we're trying to find matches for either symbols.  We chain it with the "any()" method which returns True if any element in the output of our ".str.contains()" is True, and test whether it returns False.  

# Regular Expressions gives us the ability to search for any pattern in text data
But what about more complicated examples?  How can we clean a Phone number column that looks like below fopr example?  Where Phone numbers can contain a range of symbols from plus signs, dashes, parenthesis and maybe more.  This is where Regular Expressions come in.  Regular Expressions gives us the ability to search for any pattern in text data, like only digits for example.  They are likely control + find in your browser, but way more dynamic and robust.  

# *******************************************************************************************************************
Lets take a look at this example.  Here we are attempting to only extract digits from the Phone number column.  To do this, we use the ".str.replace()" method with the pattern we want to replace with an empty string.  Notice the pattern fed into the method.  This is essentially us telling Pandas top replace anything that is not digt with nothing.  

We won't get into the specifics of Regular Expressions, and how to construct them, but they are immensely useful for difficult string cleaning tasks, so make sure to check out DataCamp's course library on Regular Expressions.  

Now that we know how to clean text data, lets get to practice.  



 (1) Data inconsistency
     "+96171679912" or "0096171679912"
 (2) Fixed length violations:
     Passwords needs to be at least 8 characters
 (3) Typos:
     "+961.71.679912"


phones = pd.read_csv('phones.csv')
print(phones)

--------------------------------------------
            Full name  |       Phone number
      Noelani A. Gray  |   001-702-397-5143
        Myles Z. Gomez |   001-329-485-0540
          Gil B. Silva |   001-195-492-2338
    Prescott D. Hardin |    +1-297-996-4904
    Benedict G. Valdaz |   001-969-820-3536
      Reece M. Andrews |               4138
        Harfa E. Keith |   001-536-175-8444

phones['Phone number'] = phones['Phone number'].str.replace('+', '00')  #############################################
phones['Phone number'] = phones['Phone number'].str.replace('-', '')

phones['Phone number'] = phones.loc[len(phones['Phone number'])<10, 'Phone number'] = 'Nan'  ##### Take a try and see


# *******************************************************************************************************************
phones.loc[phones['Phone number'].str.len()<10, 'Phone number'] = np.nan



assert phones['Phone number'].str.len.min() < 10


# Assert all numbers do not have "+" or "-" symbol
assert phpnes['Phone number'].str.contains('+|-').any() == False  ###################################################

--------------------------------------------
            Full name  |       Phone number
      Noelani A. Gray  |      0017023975143
        Myles Z. Gomez |      0013294850540
          Gil B. Silva |      0011954922338
    Prescott D. Hardin |      0012979964904
    Benedict G. Valdaz |      0019698203536
      Reece M. Andrews |                NaN
        Harfa E. Keith |      0015361758444
        

--------------------------------------------
            Full name  |       Phone number
      Noelani A. Gray  |     +(01706)-25891
        Myles Z. Gomez |       +0500-571437
          Gil B. Silva |         +0800-1111
    Prescott D. Hardin |      +07058-879063
    Benedict G. Valdaz |     +(016799)-8424


# Replace letters with nothing
phones['Phone number'] = phones['Phone number'].str.replace(r'\D+', '')  ############################################

--------------------------------------------
            Full name  |       Phone number
      Noelani A. Gray  |         0170625891
        Myles Z. Gomez |         0500571437
          Gil B. Silva |           08001111
    Prescott D. Hardin |        07058879063
    Benedict G. Valdaz |         0167998424

In [65]:
import pandas as pd


phones = pd.read_csv('phones.csv')
print(phones.head(), '\n')

print(phones['Phone number'].dtypes, '\n')


phones['Phone number'] = phones['Phone number'].str.replace('-', '', regex=True)
phones['Phone number'] = phones['Phone number'].str.replace('+', '', regex=True)
# FutureWarning: The default value of regex will change from True to False in a future version. 
# In addition, single character regular expressions will *not* be treated as literal strings when regex=True.


phones['Phone'] = phones['Phone number'].str.contains('+', regex=False)#.any()  #----------------------------------
print(phones, '\n')
# Return boolean Series or Index based on whether a given pattern or regex is
# contained within a string of a Series or Index.

print(phones['Phone number'].dtypes)


phones.loc[phones['Phone number'].str.len()<10, 'Phone number'] = np.nan  #------------------------------------------
print(phones, '\n')


            Full name      Phone number
0     Noelani A. Gray  001-702-397-5143
1      Myles Z. Gomez  001-329-485-0540
2        Gil B. Silva  001-195-492-2338
3  Prescott D. Hardin   +1-297-996-4904
4  Benedict G. Valdaz  001-969-820-3536 

object 

            Full name   Phone number  Phone
0     Noelani A. Gray  0017023975143  False
1      Myles Z. Gomez  0013294850540  False
2        Gil B. Silva  0011954922338  False
3  Prescott D. Hardin    12979964904  False
4  Benedict G. Valdaz  0019698203536  False
5    Reece M. Andrews           4138  False
6      Harfa E. Keith  0015361758444  False 

object
            Full name   Phone number  Phone
0     Noelani A. Gray  0017023975143  False
1      Myles Z. Gomez  0013294850540  False
2        Gil B. Silva  0011954922338  False
3  Prescott D. Hardin    12979964904  False
4  Benedict G. Valdaz  0019698203536  False
5    Reece M. Andrews            NaN  False
6      Harfa E. Keith  0015361758444  False 



In [44]:
print(dir(phones['Phone number'].str))

help(phones['Phone number'].str.contains)

['__annotations__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__frozen', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_data', '_doc_args', '_freeze', '_get_series_list', '_index', '_inferred_dtype', '_is_categorical', '_is_string', '_name', '_orig', '_parent', '_validate', '_wrap_result', 'capitalize', 'casefold', 'cat', 'center', 'contains', 'count', 'decode', 'encode', 'endswith', 'extract', 'extractall', 'find', 'findall', 'fullmatch', 'get', 'get_dummies', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'islower', 'isnumeric', 'isspace', 'istitle', 'isupper', 'join', 'len', 'ljust', 'lower', 'lstrip', 'match', 'normalize', 'pad', 'partition', 'removeprefix', 'removesuffix', 'repeat', 'replace',

## Removing titles and taking names

# *******************************************************************************************************************
While collecting survey respondent metadata in the airlines DataFrame, the full name of respondents was saved in the "full_name" column. However upon closer inspection, you found that a lot of the different names are prefixed by honorifics such as "Dr.", "Mr.", "Ms." and "Miss".

Your ultimate objective is to create two new columns named first_name and last_name, containing the first and last names of respondents respectively. Before doing so however, you need to remove honorifics.
# *******************************************************************************************************************

The airlines DataFrame is in your environment, alongside pandas as pd.
Instructions
100 XP

    Remove "Dr.", "Mr.", "Miss" and "Ms." from full_name by replacing them with an empty string "" in that order.
    Run the assert statement using .str.contains() that tests whether full_name still contains any of the honorifics.

Hint

    The .str.replace() method takes in the pattern to find, and the pattern to replace it by.


In [72]:
airlines = pd.read_csv('airlines_final.csv', index_col=0)
print(airlines.head(), '\n')







     id        day      airline        destination    dest_region dest_size  \
0  1351    Tuesday  UNITED INTL             KANSAI           Asia       Hub   
1   373     Friday       ALASKA  SAN JOSE DEL CABO  Canada/Mexico     Small   
2  2820   Thursday        DELTA        LOS ANGELES        West US       Hub   
3  1157    Tuesday    SOUTHWEST        LOS ANGELES        West US       Hub   
4  2992  Wednesday     AMERICAN              MIAMI        East US       Hub   

  boarding_area   dept_time  wait_min     cleanliness         safety  \
0  Gates 91-102  2018-12-31     115.0           Clean        Neutral   
1   Gates 50-59  2018-12-31     135.0           Clean      Very safe   
2   Gates 40-48  2018-12-31      70.0         Average  Somewhat safe   
3   Gates 20-39  2018-12-31     190.0           Clean      Very safe   
4   Gates 50-59  2018-12-31     559.0  Somewhat clean      Very safe   

         satisfaction  
0      Very satisfied  
1      Very satisfied  
2             Neutra

In [None]:
# Replace "Dr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Dr.","")  #-----------------------------------------------

# Replace "Mr." with empty string ""
airlines['full_name'] = ____

# Replace "Miss" with empty string ""
____

# Replace "Ms." with empty string ""
____

# Assert that full_name has no honorifics
assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False  #++++++++++++++++++++++++++++++++++++++

## Keeping it descriptive

# *******************************************************************************************************************
To further understand travelers' experiences in the San Francisco Airport, the quality assurance department sent out a qualitative questionnaire to all travelers who gave the airport the worst score on all possible categories. The objective behind this questionnaire is to identify common patterns in what travelers are saying about the airport.

Their response is stored in the "survey_response" column. Upon a closer look, you realized a few of the answers gave the shortest possible character amount without much substance. In this exercise, you will isolate the responses with a character count higher than 40 , and make sure your new DataFrame contains responses with 40 characters or more using an assert statement.
# *******************************************************************************************************************

The airlines DataFrame is in your environment, and pandas is imported as pd.
Instructions
100 XP

#    Using the "airlines" DataFrame, store the length of each instance in the survey_response column in resp_length by using .str.len().
#    Isolate the rows of airlines with resp_length higher than 40.
    Assert that the smallest survey_response length in airlines_survey is now bigger than 40.


In [None]:
airlines.loc[airlines['survey_response'].str.len<40, 'survey_response'] = np.nan



In [None]:
# Store length of each row in survey_response column
resp_length = ____

# Find rows in airlines where resp_length > 40
airlines_survey = airlines[____ > ____]

# Assert minimum survey_response length is > 40
assert ____.str.len().____ > _____   ################################################################################

# Print new survey_response column
print(airlines_survey['survey_response'])

In [None]:
# Store length of each row in survey_response column
resp_length = airlines['survey_response'].str.len()

# Find rows in airlines where resp_length > 40
airlines_survey = airlines[resp_length > 40]

# Assert minimum survey_response length is > 40
assert airlines_survey['survey_response'].str.len().min() > 40

# Print new survey_response column
print(airlines_survey['survey_response'])

In [76]:
a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


def find_z(n):
    
    z = []
    for item in n:
        if item == 1:
            z.append(item)
        if item == 2:
            z.append(item)
            
        for i in range(2, item):
            for j in range (2, i):
                if i % j == 0:
                    break
                else:
                    z.append(i)
            
    return set(z)

find_z(a)

{1, 2, 3, 5, 7, 9}

## Uniformity






**Stellar work on chapter 2.  
# You're now an expert at handling categorical and text variables.  

# *******************************************************************************************************************
In this chapter, we are looking at more advanced data clearning problems, such as uniformity, cross fields validation and dealing with missing data.  

In chapter 1, we saw how out of range values are a common problem when clearning data, and that when left untouched, can skew your analysis.  In this lesson, we're going to tackle a problem that could similarly skew our data, which is unit uniformity.  For example, we can have temperature data that have values in both Fahrenheit and Celsius, weight data in Kilograms and in stones, dates in multiple formats, and so on.  Verifying uniformity is imperative to having accurate analysis.  
# *******************************************************************************************************************


Here is a dataset with average temperature data throughout the month of March in New York City.  The dataset was collected from different sources with temperature data in Celsius and Fahrenheit merged together.  We can see that unless a major climate event occurred, the value "62.6" is most likely Fahrenheit not Celsius.  

Lets confirm the presence of these values visually.  We can do so by plotting a scatter plot of our data.  We can do this using Matplotlib.pyplot imported as plt.  Use the "plt.scatter()" function, which takes in what to plot on the x axis, the y axis, and which data source to use.  We set the title, axis labels with the helper function below, then show the plot.  Notice the outer data points?  They all must be Fahrenheit.  A simple Google search returns the formula for converting Fahrenheit to Celsius.  


# *******************************************************************************************************************
To convert our temperature data, we isolate all rows of temperature column where it is above 40 using the ".loc[]" method.  We choose 40 because its a common sense maximum for Celsius temperature in New York City.  We then convert these values to Celsius using the formula and resign them to their respective Fahrenheit values in temperatures.  We can make sure that our conversion was correct with assert statement, by making sure the maximum value of temperatrure is less tha 40.  
# *******************************************************************************************************************


Here is another common uniformity problem with date data.  This is a DataFrame called birthdays containing birth dates for a variety of individuals.  It has been collected from a variety of sources and merged into one.  Notice the format of each oeservation in Birthday column, and even contains error data.  We'll learn how to deal with those.  

#  "pd.to_datetime()" function accepts different formats, but
We already discussed datetime objects.  Without getting too much into details, datetime accepts different formats that help you format your dates as pleased.  The Pandas "pd.to_datetime()" function automatically accepts most date formats, but could raise errors when certain formats are unrecognizable.  You don't have to memorize these formats, just know that they exist amnd are easily searchable.  

# this isn't enough and will most likely return an error in real-world multiple formats circumstance
You can treat these date inconsistencies easily by converting your date column to datetime.  We can do this in Pandas use the "pd.to_datetime()" function mentioned above.  However this isn't enough and will most likely return an error, since we have dates in multiple formats, especially the weird day/day/year format - the error one, which triggers an error with months.  

# *******************************************************************************************************************
# Instead we set the "infer_datetime_format=" arg equal to True and set the "errors=" arg equal top "coerce".  
This will infer the format and return missing value for dates that couldn't be identified and converted instead of a value error.  This returns the birthday column with alligned formats, with the initial ambiguous format of day-day-year, being set to NaT, which represents missing values in Pandas for datetime objects.  

# We can also convert the format of a datetime column using the ".dt.strftime()" method, 
Which accetps a datetime format of your choice.  For example, here we converthe Birthday colum to day-month-year, instead of year-month-day.  However a common problem is having ambiguous dates with vague formats.  For example is this "2019-03-08" date value set in March or August?  Unfortunately there is no clear cut way to soprt this inconsistency or to treat it.  Depending on the size of the dataset and suspected ambiguities, we can either convert these dates to NAs ad deal with them accordingly.  Or if you have additional context on the source of your data, you can probably infer the format.  If the majority of subsequent or previous data is of one format, you can probably infer the format as well.  All in all, it is essential to properly understand where your data coes from, before trying to treat it, as it will make making these decisions much easier.  

New lets make our data uniform.  



temperatures = pd.read_csv(temperature.csv)
temperature.head()

-------------------------------------------
   Date      | Temperature
   03.03.19  | 14.0
   04.03.19  | 15.0
   05.03.19  | 18.0
   06.03.19  | 16.0
   07.03.19  | 62.6
   
   
plt.scatter(data=temperatures, x='Date', y='Temperature')
plt.title()
plt.xlable()

plt.show()


temp_fah = temperature.loc[temperature['Temperature']>40, 'Temperaature']   #++++++++++++++++++++++++++++++++++++++++

temperature.loc[temperature['Temperature']>40, 'Temperaature'] = (temp_fah - 32) * (5/9)  #++++++++++++++++++++++++++



---------------------------------------------------
   Birthday         | First name   | Last name
   27/27/19         | Rowan        | Nunex
   03-29-19         | Brynn        | Yang
   March 3rd, 2019  | Sophia       | Reilly
   24-03-19         | Deacon       | Prince
   06-03-19         | Griffith     | Neal
   

-----------------------------------------
Date                | datetime format
25-12-2019          | %d-%m-%Y
December 25th 2019  | %c
12-25-2019          | %m-%d-%Y
...                 | ...

pd.to_datetime() can recognize most formats automatically

# Converts to datetime - but won't work cause the error data day-day-year
birthdays['Birthday'] = pd.to_datetime(birthdays['Birthday'])
  ValueError: month must be in 1..12
  
# Will work
birthday['Birthday'] = pd.to_datetime(birthday['Birthday'], 
                                      # Attempt to infer format of each date
                                      infer_datetime_format=True,   #++++++++++++++++++++++++++++++++++++++++++++++++
                                      # Return NA fpr rows where conversion failed
                                      errors='coerce')   #+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

## Ambiguous dates

You have a DataFrame containing a "subscription_date" column that was collected from various sources with different Date formats such as YYYY-mm-dd and YYYY-dd-mm. What is the best way to unify the formats for ambiguous values such as 2019-04-07?
Answer the question
50XP
Possible Answers

    Set them to NA and drop them.
    press   # If we wish take the easy way
    1
    Infer the format of the data in question by checking the format of subsequent and previous values.
    press   # Could be typo too
    2
    Infer the format from the original data source.
    press   # If its provided
    3
#    All of the above are possible, as long as we investigate where our data comes from, and understand the dynamics affecting it before cleaning it.
    press
    4
    
Hint

    Ambiguous date formats represent a data cleaning challenge that requires a solid understanding of where your data comes from.


## Uniform currencies

# *******************************************************************************************************************
In this exercise and throughout this chapter, you will be working with a retail banking dataset stored in the banking DataFrame. The dataset contains data on the amount of money stored in accounts (acct_amount), their currency (acct_cur), amount invested (inv_amount), account opening date (account_opened), and last transaction date (last_transaction) that were consolidated from American and European branches.

# You are tasked with understanding the average account size and how investments vary by the size of account, 
however in order to produce this analysis accurately, you first need to unify the currency amount into dollars. The pandas package has been imported as pd, and the banking DataFrame is in your environment.
Instructions
100 XP

#    Find the rows of "acct_cur" in banking that are equal to 'euro' and store them in the variable "acct_eu".
#    Find all the rows of "acct_amount" in banking that fit the "acct_eu" condition, and convert them to USD by multiplying them with 1.1.
    Find all the rows of "acct_cur" in banking that fit the "acct_eu" condition, set them to 'dollar'.


In [6]:
import pandas as pd


banking = pd.read_csv('banking_dirty.csv', index_col=0)
print(banking.sample(9))



# Find values of acct_cur that are equal to 'euro'
acct_eu = banking['acct_cur'] == 'euro'

# Convert acct_amount where it is in euro to dollars
banking.loc[acct_eu, 'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1   #++++++++++++++++++++++++++++++++++

# Unify acct_cur column by changing 'euro' values to 'dollar'
banking.loc[acct_eu, 'acct_cur'] = 'dollar'  #+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

# Assert that only dollar currency remains
assert banking['acct_cur'].unique() == 'dollar'

     cust_id  birth_date  Age   acct_amount  inv_amount   fund_A   fund_B  \
52  A631984D  1978-02-27   42  7.779933e+04       46410   2188.0  35508.0   
63  0B44C3F8  1975-02-20   45  3.398487e+04       31393  10373.0   5583.0   
6   6B094617  1977-08-26   43  8.985598e+04       34549   1796.0    312.0   
61  45F31C81  1975-01-12   49  1.206753e+08       94608  15416.0  18845.0   
0   870A9281  1962-06-09   58  6.352331e+04       51295  30105.0   4138.0   
40  777A7F2C  1973-08-12   47  5.268417e+04       20968    380.0   2402.0   
15  3C5CBBD7  1971-05-20   49  5.967801e+04       35937   4133.0   8540.0   
94  A731C34E  1961-06-03   59  9.535202e+04       84065  12061.0  15742.0   
18  C9FB0E86  1965-10-04   55  8.868234e+04       26164   5504.0   4063.0   

     fund_C   fund_D account_opened last_transaction  
52    331.0   8383.0       26-01-18         06-10-19  
63   1669.0  13768.0       10-04-18         28-09-19  
6   20610.0  11831.0       06-02-18         14-02-19  
61  20325

## Uniform dates

After having unified the currencies of your different account amounts, you want to add a temporal dimension to your analysis and see how customers have been investing their money given the size of their account over each year. The account_opened column represents when customers opened their accounts and is a good proxy for segmenting customer activity and investment over time.

However, since this data was consolidated from multiple sources, you need to make sure that all dates are of the same format. You will do so by converting this column into a datetime object, while making sure that the format is inferred and potentially incorrect formats are set to missing. The banking DataFrame is in your environment and pandas was imported as pd.
Instructions 1/4
25 XP

    Question 1
    Print the header of "account_opened" from the "banking" DataFrame and take a look at the different results.
    
    
    
    Question 2
    Question:
--------
Take a look at the output. You tried converting the values to datetime  using  the  default 
"to_datetime()" function without changing any argument, however received the following error:
ValueError: month must be in 1..12
Why do you think that is?
Possible Answers
- The to_datetime() function needs to be explicitly told which date format each row is in.[X]
- The to_datetime() function can only be applied on YY-mm-dd date formats.[X]
# - The 21-14-17 entry is erroneous and leads to an error.[Correct]
    
    
    
    Question 3
    Convert the account_opened column to datetime, while making sure the date format is inferred
  and that erroneous formats that raise error return a missing value.
    
    
    
    Question 4
    Extract the year from the amended "account_opened" column and assign it to the "acct_year" column.
- Print the newly created "acct_year" column


In [None]:
print(banking['account_opened'].unique())

print(banking)




banking['account_opened'] = pd.to_datetime(banking['account_opened'], 
                                           infer_datetime_format=True, 
                                           errors='coerce')


banking['acct_year'] = banking['account_opened'].dt.year

In [13]:

banking = pd.read_csv('banking_dirty.csv', index_col=0)
print(banking.sample(9))


banking['account_opened'] = pd.to_datetime(banking['account_opened'], 
                                           infer_datetime_format=True, 
                                           errors='coerce')

banking['year'] = banking['account_opened'].dt.year  #+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
print(banking.head())

print(banking['account_opened'].dtypes)
print(banking['year'].dtypes)

     cust_id  birth_date  Age  acct_amount  inv_amount   fund_A   fund_B  \
12  EEBD980F  1990-11-20   34     57838.49       50812  18314.0   1477.0   
41  2EC1B555  1974-03-17   46     55976.78       51477   8303.0  24112.0   
78  F7FC8F78  1974-05-07   46     88049.82       84430   4590.0  24786.0   
74  904A19DD  1987-02-18   33     31981.36       13188   9599.0    858.0   
33  B5D367B5  1981-02-12   39     44226.86       36571   1280.0   8191.0   
64  5321D380  1982-04-30   38     59700.08        8143    117.0   1198.0   
60  5AEA5AB8  1972-10-24   48    100266.99       89341     41.0  13870.0   
2   BFC13E88  1990-09-12   34     59863.77       24567  10323.0   4590.0   
35  078C654F  1993-10-17   27     87312.64       66529   3684.0  17635.0   

      fund_C   fund_D account_opened last_transaction  
12  29049.48   5539.0       08-12-18         04-01-20  
41  15776.00   3286.0       05-12-17         21-10-19  
78   3346.00  51708.0       28-02-18         30-04-18  
74   1083.00   

In [None]:
birthday['Birthday'] = pd.to_datetime(birthday['Birthday'], 
                                      # Attempt to infer format of each date
                                      infer_datetime_format=True,   #++++++++++++++++++++++++++++++++++++++++++++++++
                                      # Return NA fpr rows where conversion failed
                                      errors='coerce')   #+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


banking['acct_year'] = banking['account_opened'].dt.strftime('%Y')

## Cross field validation






**Here is the second lesson of this chapter.  In this lesson, we'll talk about cross field validation for diagnosing dirty data.  Let take a look at the following dataset.  

It contains flight statistics on the total number of passengers in economy, business and first class as well as the total passengers for each flight.  
# *******************************************************************************************************************
We know that these columns have been collected and merged from different data sources, and a common challenge when merging data from different sources is data integrity, or more broadly making sure that our data is correct.  This is where "cross field validation" comes in.  

# *******************************************************************************************************************
"Cross Field Validation" is the use of multiple fields in your dataset to sanity check the integrity of your data.  For example in our flights dataset, this could be summing economy, business and first class values and making sure they are equal to the total passengers on the plane.  This could be easily done in Pandas, by first subsetting on the columns to sum, then use the ".sum()" method with the axis argument set to 1 to indicate row wise summing.  We then find the instances where the total passengers column is equal to the sum of the classes.  And find and filter out instances of inconsistent passenger amounts by subsetting on the equality we created with brackets and the tilde symbol.  
# *******************************************************************************************************************

Here is another example containing user IDs, birthdays and age values for a set of users.  We can for example make sure that the age and birthday columns are correct by subtracting the number of years between today's date and each birthday.  We can do this by first making sure the Birthday column is converted to "datetime64[ns]" with the Pandas "pd.to_datetime()" function.  We then create an object to storing today's date using the datetime package's "date.today()" function.  We then calculate the difference in years between today's date's year and the year of each birthday by using the ".dt.year" attribute of the users Birthdat column (for datetime object we have .year attribute).  We then find instances where the calculated ages are equal to the actual age column in the users DF.  We then find and filter out the instances where we have inconsistences using subsetting with brackets and tilde symbol on the equality we created.  


# So what should be the course of action in case we spot inconsistencies with "Cross Field Validation"?  
Just like other data cleaning problems, there is no one size fits all solution, as often the best solution requires an in depth understanding of our datasets.  

We can decide either drop inconsistent data, set is to missing and impute it, or apply some rules due to domain knowledge.  All these routes and assumptions can be decided upon only when you have a good understanding of where your dataset comes from different sources feeding into it.  


Now that you know "Cross Field Validation", lets get to practice.  



-------------------------------------------------------------------------------------
   flight_number | economy_class | business_class | first_class | total_passengers
           DL140 |           100 |             60 |          40 |              200
           BA248 |           130 |            100 |          70 |              300
          MEA124 |           100 |             50 |          50 |              200
          AFR939 |           140 |             70 |          90 |              300
           DL140 |           130 |            100 |          20 |              250

sum_classes = flights[['economy_class', 'business_class', 'first_class']].sum(axis=1)
passenger_equ = sum_classes == flights['total_passengers']

# Find and filter out rows with inconsistent passenger totals
inconsistent_pass = flights[~passenger_equ]
consistsnce_pass = flights[passenger_equ]


import datetime
# Convert to datetime and get today's date
users['Birthday'] = pd.to_datetime(users['Birthday'])  # If we assume all observation in one format, no concatinate

today = datetime.date.today()

# For each row in the Birthday column, calculate year difference
age_manual = today.year - user['Birthday'].dt.year

# Find instances where ages match
age_equ = age_manual == users['age']

# Find and filter out rows with inconsistent age
inconsistent_age = users[~age_equ]
consistent_age = age[age_equ]

In [None]:
## Cross field or no cross field?

Throughout this course, you've been immersed in a variety of data cleaning problems from range constraints, data type constraints, uniformity and more.

In this lesson, you were introduced to cross field validation as a means to sanity check your data and making sure you have strong data integrity.

Now, you will map different applicable concepts and techniques to their respective categories.
Instructions
100XP

    Map different applicable concepts and techniques to their respective categories.


Cross Field Validation:
    Comfirming the Age provided by users by cross checking their birthdays
    Row wise operations such as ".sum(axis=1)"
    
Not Cross Field Validation:
    Making sure that a "revenue" column is a numeric column
    Making sure a "subscription_date" column has no values set in the future
    The use the the ".astype()" method.
    

## How's our data integrity?

New data has been merged into the banking DataFrame that contains details on how investments in the inv_amount column are allocated across four different funds A, B, C and D.

Furthermore, the age and birthdays of customers are now stored in the age and birth_date columns respectively.

You want to understand how customers of different age groups invest. However, you want to first make sure the data you're analyzing is correct. You will do so by cross field checking values of inv_amount and age against the amount invested in different funds and customers' birthdays. Both pandas and datetime have been imported as pd and dt respectively.
Instructions 1/2
50 XP

    Question 1
        Find the rows where the sum of all rows of the fund_columns in banking are equal to the inv_amount column.
        Store the values of banking with consistent inv_amount in consistent_inv, and those with inconsistent ones in inconsistent_inv.

    Question 2

        Store today's date into today, and manually calculate customers' ages and store them in ages_manual.
        Find all rows of banking where the age column is equal to ages_manual and then filter banking into consistent_ages and inconsistent_ages.


In [None]:
# Store fund columns to sum against
fund_columns = ['fund_A', 'fund_B', 'fund_C', 'fund_D']

# Find rows where fund_columns row sum == inv_amount
inv_equ = banking[____].____(____) == ____

# Store consistent and inconsistent data
consistent_inv = ____[____]
inconsistent_inv = ____[____]

# Store consistent and inconsistent data
print("Number of inconsistent investments: ", inconsistent_inv.shape[0])