<a href="https://colab.research.google.com/github/nirmalaraj77/Python/blob/main/Cleaning_Data_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cleaning Data in Python**

* .info()
* .describe()
* assert
* new unnamed index column - index_col = 'Unnamed: 0'
* ~ subset




In [14]:
# import practise datasets
import pandas as pd

airlines = pd.read_csv('https://raw.githubusercontent.com/nirmalaraj77/datasets/refs/heads/main/airlines_final.csv', index_col = 'Unnamed: 0')
banking = pd.read_csv('https://raw.githubusercontent.com/nirmalaraj77/datasets/refs/heads/main/banking_dirty.csv', index_col = 'Unnamed: 0')
ride_sharing = pd.read_csv('https://raw.githubusercontent.com/nirmalaraj77/datasets/refs/heads/main/ride_sharing_new.csv', index_col = 'Unnamed: 0')
restaurants = pd.read_csv('https://raw.githubusercontent.com/nirmalaraj77/datasets/refs/heads/main/restaurants_L2.csv', index_col = 'Unnamed: 0')
restaurants_new = pd.read_csv('https://raw.githubusercontent.com/nirmalaraj77/datasets/refs/heads/main/restaurants_L2_dirty.csv', index_col = 'Unnamed: 0')

## **String to Integers**

###Remove '£' from Revenue column
* Sales['Revenue'] = Sales['Revenue'].str.strip('£')

###Convert Revenue column to integer
* Sales['Revenue'] = Sales['Revenue'].astype('int')

###Verify datatype
* assert Sales['Revenue'].dtype == 'int'




## **Numeric to Categorical**

###Convert to category
* df['col_name'] = df['col_name'].astype('category')



In [None]:
# Print the information of ride_sharing
print(ride_sharing.info())

# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())

# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')

# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'

# Print new summary statistics
print(ride_sharing['user_type_cat'].describe())

<class 'pandas.core.frame.DataFrame'>
Index: 25760 entries, 0 to 25759
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   duration         25760 non-null  object
 1   station_A_id     25760 non-null  int64 
 2   station_A_name   25760 non-null  object
 3   station_B_id     25760 non-null  int64 
 4   station_B_name   25760 non-null  object
 5   bike_id          25760 non-null  int64 
 6   user_type        25760 non-null  int64 
 7   user_birth_year  25760 non-null  int64 
 8   user_gender      25760 non-null  object
 9   duration_trim    25760 non-null  object
 10  duration_time    25760 non-null  int64 
dtypes: int64(6), object(5)
memory usage: 2.4+ MB
None
count    25760.000000
mean         2.008385
std          0.704541
min          1.000000
25%          2.000000
50%          2.000000
75%          3.000000
max          3.000000
Name: user_type, dtype: float64
count     25760
unique        3
top           2


In [None]:
# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes')

# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')

# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'

# Print formed columns and calculate average ride duration
print(ride_sharing[['duration','duration_trim','duration_time']])
print(ride_sharing['duration_time'].mean())

         duration duration_trim  duration_time
0      12 minutes           12              12
1      24 minutes           24              24
2       8 minutes            8               8
3       4 minutes            4               4
4      11 minutes           11              11
...           ...           ...            ...
25755  11 minutes           11              11
25756  10 minutes           10              10
25757  14 minutes           14              14
25758  14 minutes           14              14
25759  29 minutes           29              29

[25760 rows x 3 columns]
11.389052795031056


## **Out of range dates**

* import datetime as dt
* today_date = dt.date.today()
* find dates in the future
* df['subs_date'] > today_date

In [None]:
import datetime as dt
today_date = dt.date.today()
today_date
birthday = dt.date(1999, 10, 2)
today_date < birthday

False

## **Deal with out of range data**

* drop
* setting custom minimum and maximums
* treat as missing and impute
* setting custom value depending on business assumption

### Drop values using filtering
* movies = movies['rating'] <= 5

### Drop values using .drop
movies.drop(movies[movies['rating'] > 5].index, inplace = True)

### Set custom maximum value
* movies.loc[movies['rating'] > 5, 'rating'] = 5

### Convert to date
* import datetime as dt
* movie['signup'] = pd.to_datetime(movie['signup']).dt.date


### Drop future dates using filtering or .drop

### Hardcode dates with upper limit
today_date = dt.date.today()
* movies.loc[movies['signup'] > today_date, 'signup'] = today_date








## **Uniqeness Constraints**

* **Subsetting on metadata and keeping all duplicate records gives you a better bird-eye's view over your data and how to duplicate it**

### Find Duplicates

* .duplicated()
* True for duplicated and False for non-duplicated
* subset: list of column names to check for duplication
* keep: whether to keep 'first', 'last' or all ('False') duplicate values
* .sort_values(by = 'col_name')

### Drop full duplicates

* .drop_duplicates()
* subset: list of column names to check for duplication
* keep: whether to keep 'first', 'last' or all ('False') duplicate values
* inplace: Drop duplicated directly in df ('True')

### Drop partial duplicates

* Group by column names and produce statistical summaries
* .goupby() and .agg()
* .reset_index() - numbered indices in final output
* column_names = ['first_name', 'last_name', 'addres']
* summaries = {'height' : 'max', 'weight' : 'mean'}
* height_weight = height_weight.groupby(by = column_names).agg(summaries).reset_index()
*

### E.g.
* Find duplicates
* duplicates = ride_sharing.duplicated(subset = 'ride_id', keep = False)

* Sort your duplicated rides
* duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')

* Print relevant columns
* print(duplicated_rides[['ride_id','duration','user_birth_year']])

### E.g.
* Drop complete duplicates from ride_sharing
* ride_dup = ride_sharing.drop_duplicates()

* Create statistics dictionary for aggregation function
* statistics = {'user_birth_year': 'min', 'duration': 'mean'}

* Group by ride_id and compute new statistics
* ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()

* Find duplicated values again
* duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
* duplicated_rides = ride_unique[duplicates == True]

* Assert duplicates are processed
* assert duplicated_rides.shape[0] == 0






## **Categories and Membership Constraints**

### Find inconsistent categories

* .set and .difference
* create list of categories
* find categories in df not in list
* find rows in df matching inconsistent categories
* .isin
* subset df with boolean result

###  Drop inconsistent categories and get consistent categories only

* ~ subset





In [22]:
# use airlines dataset
airlines.head()

# create dictionary of categories
cat_all = {'cleanliness' : ['Clean', 'Average', 'Somewhat clean', 'Somewhat dirty', 'Dirty'],
           'safety' : ['Neutral', 'Very safe', 'Somewhat safe', 'Very unsafe', 'Somewhat unsafe'],
           'satisfaction' : ['Very satisfied', 'Neutral', 'Somewhat satisfied', 'Somewhat unsatisfied', 'Very unsatisfied']}

# create categories df
categories = pd.DataFrame (cat_all)

# Print unique values of survey columns in airlines
print('Cleanliness: ', airlines['cleanliness'].unique(), "\n")
print('Safety: ', airlines['safety'].unique(), "\n")
print('Satisfaction: ', airlines['satisfaction'].unique(), "\n")

# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

# Print rows with consistent categories only
# print(airlines[~cat_clean_rows])

Cleanliness:  ['Clean' 'Average' 'Unacceptable' 'Somewhat clean' 'Somewhat dirty'
 'Dirty'] 

Safety:  ['Neutral' 'Very safe' 'Somewhat safe' 'Very unsafe' 'Somewhat unsafe'] 

Satisfaction:  ['Very satisfied' 'Neutral' 'Somewhat satsified' 'Somewhat unsatisfied'
 'Very unsatisfied'] 

       id        day           airline  destination  dest_region dest_size  \
4    2992  Wednesday          AMERICAN        MIAMI      East US       Hub   
18   2913     Friday  TURKISH AIRLINES     ISTANBUL  Middle East       Hub   
100  2321  Wednesday         SOUTHWEST  LOS ANGELES      West US       Hub   

    boarding_area dept_time  wait_min   cleanliness         safety  \
4     Gates 50-59  31-12-18       559  Unacceptable      Very safe   
18   Gates 91-102  31-12-18       225  Unacceptable      Very safe   
100   Gates 20-39  31-12-18       130  Unacceptable  Somewhat safe   

           satisfaction  
4    Somewhat satsified  
18   Somewhat satsified  
100  Somewhat satsified  


## **Categories and Value Constraints**

### Capitalize or Lowercase
* df['category'] = df['category'].str.upper()
* df['category'] = df['category'].str.lower()
* df.groupby['category'].count()
* df['category'].value_counts()

### Leading or Trailing Spaces
* df = df['category'].str.strip()

### Collapsing Data into Categories
* create categories out of income_group column from income column

1. using qcut from pandas
* group_names = ['0-200K', '200K-500K', '500K+']
* demographics = ['income_group'] = pd.qcut(demographics['household_income'], q = 3, labels = group_names)

2. using cut from pandas - create category ranges and names
* ranges = [0, 200000, 500000, np.inf)
* group_names = ['0-200K', '200K-500K', '500K+']
* demographics = ['income_group'] = pd.cut(demographics['household_income'], bins = ranges, labels = group_names)


###  Collapsing data into categories
* map categories to fewer ones

1. create mapping dictionary and replace
* mapping = {'Microsoft': 'DesktopOS', 'MacOS' : 'DesktopOS', 'IOS' ; 'MobileOS', 'Android' : 'MobileOS'}

* devices['operating_system'] = devices['operating_system'].replace(mapping)









