# Importing and Cleaning Data

### Shortcuts
    RUN
        Ctrl-Enter: RUN selected cells
        Alt + Enter: RUN the current cell, then INSERT another one below if required.
        Shift + Enter: RUN the current cell, then SELECT the next one.
    
    EDIT
        ESC: a shortcut to enter the command mode
        A:  New Cell Above
        B:  New Cell Below
        M:  CHANGE to Markdown cell
        Y:  CHANGE to Code cell
        DD: DELETE the current cell
    
    Enter: Edit Mode
    
    Comment
        Ctrl + /  : Comment/Uncomment
    
    Shift Tab : Keyword Documentation/Help
    Ctrl Shift - : SPLIT the current cell into two from where your Cursor is
    Shift M : merge multiple cells
    Shift J/Shift Down : selects the next sell in a downwards direction
    Shift K/Shift Up : selects the next sell in a upwards direction

In [12]:
import numpy as np
import pandas as pd
cd = pd.read_csv('Toyota.csv',index_col=0,na_values=["??","????"])

In [13]:
cdt = cd.copy(deep=True)

In [14]:
# Concise Summary of Dataframe
cdt.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1436 entries, 0 to 1435
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Price      1436 non-null   int64  
 1   Age        1336 non-null   float64
 2   KM         1421 non-null   float64
 3   FuelType   1336 non-null   object 
 4   HP         1430 non-null   float64
 5   MetColor   1286 non-null   float64
 6   Automatic  1436 non-null   int64  
 7   CC         1436 non-null   int64  
 8   Doors      1436 non-null   object 
 9   Weight     1436 non-null   int64  
dtypes: float64(4), int64(4), object(2)
memory usage: 123.4+ KB


## Converting Data Types
ASTYPE

In [16]:
# ASTYPE() : Convert Data Types
# Converting MetColor and Automatic to OBJECT Data Type
cdt['MetColor'] = cdt['MetColor'].astype('object')
cdt['Automatic'] = cdt['Automatic'].astype('object')

In [31]:
cdt.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1436 entries, 0 to 1435
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Price      1436 non-null   int64  
 1   Age        1336 non-null   float64
 2   KM         1421 non-null   float64
 3   FuelType   1336 non-null   object 
 4   HP         1430 non-null   float64
 5   MetColor   1286 non-null   object 
 6   Automatic  1436 non-null   object 
 7   CC         1436 non-null   int64  
 8   Doors      1436 non-null   object 
 9   Weight     1436 non-null   int64  
dtypes: float64(3), int64(3), object(4)
memory usage: 123.4+ KB


In [25]:
np.unique(cdt['FuelType'].astype('str'))

array(['CNG', 'Diesel', 'Petrol', 'nan'], dtype=object)

## Memory Consumed
nbytes() : get the total bytes consumed by a column
Memory Consumed can be reduced using Categorical Data Type

In [27]:
cdt['FuelType'].nbytes

11488

In [30]:
cdt['FuelType'].astype('category').nbytes

1460

## Cleaning "Doors" Column
replace()

numpy.where()

In [33]:
np.unique(cdt['Doors'])

array(['2', '3', '4', '5', 'five', 'four', 'three'], dtype=object)

In [37]:
cdt['Doors'].replace('three',3,inplace=True)
cdt['Doors'].replace('four',4,inplace=True)
cdt['Doors'].replace('five',5,inplace=True)

In [44]:
np.unique(cdt['Doors'].astype('int'))

array([2, 3, 4, 5])

In [45]:
cdt['Doors'] = cdt['Doors'].astype('int64')

In [46]:
np.unique(cdt['Doors'])

array([2, 3, 4, 5], dtype=int64)

## Missing Values Detection
isnull().sum() : number of missing values

In [50]:
# cdt.isnull().sum()
cdt['KM'].isnull().sum()

15

# Control, Functions etc
IF-ELIF-ELSE, FOR, WHILE

In [51]:
# Create 3 bins for the Car Price
# Create a New Column as Price_Class
# insert(col_position, col_name,Values)
cdt.insert(10,"Price_Class","")

In [57]:
for i in range(0,len(cdt['Price']),1):
    if cdt['Price'][i] <= 8450:
        cdt['Price_Class'][i] = "Low"
    elif cdt['Price'][i] > 11950:
        cdt['Price_Class'][i] = "High"
    else:
        cdt['Price_Class'][i] = "Medium"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cdt['Price_Class'][i] = "High"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cdt['Price_Class'][i] = "Low"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cdt['Price_Class'][i] = "Medium"


### Get Frequencies of each values in a column(Occurances)

In [55]:
cdt['Price_Class'].value_counts()

Medium    751
Low       369
High      316
Name: Price_Class, dtype: int64

### Functions

In [58]:
# New Column as Age_Converted
cdt.insert(11,"Age_Converted",0)

In [62]:
# Function to get Age(Months) and return converted Age(Years)
def c_convert(val):
    val_converted = val/12
    return val_converted

In [63]:
cdt['Age_Converted'] = c_convert(cdt['Age'])

In [64]:
# Rounded upto 1 decimals
cdt['Age_Converted'] = round(cdt['Age_Converted'],1)

In [66]:
cdt['Age_Converted'] = np.where(cdt['Age'] > 10, True, False)

In [67]:
cdt['Age_Converted'].value_counts()

True     1311
False     125
Name: Age_Converted, dtype: int64

In [73]:
cdt['Age'].head()

0    23.0
1    23.0
2    24.0
3    26.0
4    30.0
Name: Age, dtype: float64

In [69]:
cdt['Age_Converted'].tail()

1431    False
1432     True
1433    False
1434     True
1435     True
Name: Age_Converted, dtype: bool

In [70]:
cdt['Age_Converted'][2]

True

In [71]:
cdt['Age_Converted'][2] = "X"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cdt['Age_Converted'][2] = "X"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [74]:
len(cdt.columns)

12

In [75]:
len(cdt)

1436

In [76]:
cdt.size

17232

In [77]:
cdt.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1436 entries, 0 to 1435
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Price          1436 non-null   int64  
 1   Age            1336 non-null   float64
 2   KM             1421 non-null   float64
 3   FuelType       1336 non-null   object 
 4   HP             1430 non-null   float64
 5   MetColor       1286 non-null   object 
 6   Automatic      1436 non-null   object 
 7   CC             1436 non-null   int64  
 8   Doors          1436 non-null   int64  
 9   Weight         1436 non-null   int64  
 10  Price_Class    1436 non-null   object 
 11  Age_Converted  1436 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 178.1+ KB
