# Renaming Columns
For this notebook, we will be learning about the `.rename()` method for pandas DataFrames.
Additional information about each method can be found here - [rename](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html).
We'll be using the `data_08_v1.csv` and `data_18_v1.csv` dataset which is provided in the workspace and created from the last page where we dropped columns.

In [36]:
# load datasets
import pandas as pd 
df_08 = pd.read_csv('./data_08_v1.csv')
df_18 = pd.read_csv('./data_18_v1.csv')

In [37]:
# view 2008 dataset
df_08.head(1)

Unnamed: 0,Model,Displ,Cyl,Trans,Drive,Fuel,Sales Area,Veh Class,Air Pollution Score,City MPG,Hwy MPG,Cmb MPG,Greenhouse Gas Score,SmartWay
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,SUV,7,15,20,17,4,no


In [38]:
# view 2018 dataset
df_18.head(1)

Unnamed: 0,Model,Displ,Cyl,Trans,Drive,Fuel,Cert Region,Veh Class,Air Pollution Score,City MPG,Hwy MPG,Cmb MPG,Greenhouse Gas Score,SmartWay
0,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,FA,small SUV,3,20,28,23,5,No


### Rename Columns

#### Rename with a dictionary
Renaming can be done with a dictionary which maps `{"current_title": "renamed_title"}`

In [39]:
# Use the .rename() method on the df_08 DataFrame
# rename the 'Sales Area' to 'Cert Region' by assigning it with the `columns` parameter
# use the `inplace=True` parameter to rename within the same DataFrame
df_08.rename(columns={'Sales Area':'Cert Region'},inplace=True)
# confirm changes
df_08.head(1)

Unnamed: 0,Model,Displ,Cyl,Trans,Drive,Fuel,Cert Region,Veh Class,Air Pollution Score,City MPG,Hwy MPG,Cmb MPG,Greenhouse Gas Score,SmartWay
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,SUV,7,15,20,17,4,no


#### Rename with a function
Renaming can be done with a function. A function can be assigned traditionally with `def myfunctions()` or with a lambda function, `lambda x: x`. You can learn more about lambda functions in the [Python 3 Documentation](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions).

In [40]:
# replace spaces with underscores and lowercase labels for 2008 dataset
df_08.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True)

# confirm changes
df_08.head(1)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,SUV,7,15,20,17,4,no


In [41]:
# replace spaces with underscores and lowercase labels for 2018 dataset
df_08.rename(columns=lambda x: x.strip().lower().replace('-','_'),inplace=True)

# confirm changes
df_08.head(1)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,SUV,7,15,20,17,4,no


In [42]:
# replace spaces with underscores and lowercase labels for 2018 dataset
df_18.rename(columns=lambda x: x.strip().lower().replace(' ','_'),inplace=True)

# confirm changes
df_18.head(1)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway
0,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,FA,small SUV,3,20,28,23,5,No


In [43]:
# confirm column labels for 2008 and 2018 datasets are identical
df_08.columns == df_18.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True])

In [44]:
# make sure they're all identical like this
(df_08.columns == df_18.columns).all()

True

In [45]:
# save new datasets for next section
df_08.to_csv('data_08_v2.csv', index=False)
df_18.to_csv('data_18_v2.csv', index=False)

In [46]:
df1 = pd.read_csv("data_08_v2.csv")

In [47]:
df1.head()

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,SUV,7,15,20,17,4,no
1,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,FA,SUV,6,15,20,17,4,no
2,ACURA RDX,2.3,(4 cyl),Auto-S5,4WD,Gasoline,CA,SUV,7,17,22,19,5,no
3,ACURA RDX,2.3,(4 cyl),Auto-S5,4WD,Gasoline,FA,SUV,6,17,22,19,5,no
4,ACURA RL,3.5,(6 cyl),Auto-S5,4WD,Gasoline,CA,midsize car,7,16,24,19,5,no


In [48]:
# selectiong and filtering data by two ways by indexing and Query function :
df_2wd=df1[df1['drive']=='2WD']
df_2wd.head()


Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway
6,ACURA TL,3.2,(6 cyl),Auto-S5,2WD,Gasoline,CA,midsize car,7,18,26,21,6,yes
7,ACURA TL,3.5,(6 cyl),Auto-S5,2WD,Gasoline,CA,midsize car,7,17,26,20,6,yes
8,ACURA TL,3.5,(6 cyl),Man-6,2WD,Gasoline,CA,midsize car,7,18,27,21,6,yes
9,ACURA TL,3.2,(6 cyl),Auto-S5,2WD,Gasoline,FA,midsize car,6,18,26,21,6,no
10,ACURA TL,3.5,(6 cyl),Auto-S5,2WD,Gasoline,FA,midsize car,6,17,26,20,6,no


In [49]:
# second way to index data frame is Query function
df_2wd=df1.query('drive == "4WD"')
df_2wd.head()

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,SUV,7,15,20,17,4,no
1,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,FA,SUV,6,15,20,17,4,no
2,ACURA RDX,2.3,(4 cyl),Auto-S5,4WD,Gasoline,CA,SUV,7,17,22,19,5,no
3,ACURA RDX,2.3,(4 cyl),Auto-S5,4WD,Gasoline,FA,SUV,6,17,22,19,5,no
4,ACURA RL,3.5,(6 cyl),Auto-S5,4WD,Gasoline,CA,midsize car,7,16,24,19,5,no


In [50]:
# this the index way for Query dataframe

df_cert_ca=df1[df1['cert_region']=='CA']
df_cert_ca.head()

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,SUV,7,15,20,17,4,no
2,ACURA RDX,2.3,(4 cyl),Auto-S5,4WD,Gasoline,CA,SUV,7,17,22,19,5,no
4,ACURA RL,3.5,(6 cyl),Auto-S5,4WD,Gasoline,CA,midsize car,7,16,24,19,5,no
6,ACURA TL,3.2,(6 cyl),Auto-S5,2WD,Gasoline,CA,midsize car,7,18,26,21,6,yes
7,ACURA TL,3.5,(6 cyl),Auto-S5,2WD,Gasoline,CA,midsize car,7,17,26,20,6,yes


In [51]:
# second way to index data frame is Query function 
df_cert_ca=df1.query('cert_region == "FA"')
df_cert_ca.head()

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway
1,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,FA,SUV,6,15,20,17,4,no
3,ACURA RDX,2.3,(4 cyl),Auto-S5,4WD,Gasoline,FA,SUV,6,17,22,19,5,no
5,ACURA RL,3.5,(6 cyl),Auto-S5,4WD,Gasoline,FA,midsize car,6,16,24,19,5,no
9,ACURA TL,3.2,(6 cyl),Auto-S5,2WD,Gasoline,FA,midsize car,6,18,26,21,6,no
10,ACURA TL,3.5,(6 cyl),Auto-S5,2WD,Gasoline,FA,midsize car,6,17,26,20,6,no


In [54]:
df_cert_ca.describe()

Unnamed: 0,displ
count,1157.0
mean,3.760156
std,1.326928
min,1.3
25%,2.5
50%,3.5
75%,4.8
max,8.4


In [59]:
#check for missing values in df_08
df_08.isnull().sum()

model                     0
displ                     0
cyl                     199
trans                   199
drive                    93
fuel                      0
cert_region               0
veh_class                 0
air_pollution_score       0
city_mpg                199
hwy_mpg                 199
cmb_mpg                 199
greenhouse_gas_score    199
smartway                  0
dtype: int64

In [60]:
df_18.isnull().sum()

model                   0
displ                   2
cyl                     2
trans                   0
drive                   0
fuel                    0
cert_region             0
veh_class               0
air_pollution_score     0
city_mpg                0
hwy_mpg                 0
cmb_mpg                 0
greenhouse_gas_score    0
smartway                0
dtype: int64

In [61]:
#drop any rows include any missing values or nan to improve data Wrangling 
df_08.dropna(how='any',axis=0,inplace=True)
df_18.dropna(how='any',axis=0,inplace=True)

In [63]:
print(df_08.isnull().sum())
print(df_18.isnull().sum())

model                   0
displ                   0
cyl                     0
trans                   0
drive                   0
fuel                    0
cert_region             0
veh_class               0
air_pollution_score     0
city_mpg                0
hwy_mpg                 0
cmb_mpg                 0
greenhouse_gas_score    0
smartway                0
dtype: int64
model                   0
displ                   0
cyl                     0
trans                   0
drive                   0
fuel                    0
cert_region             0
veh_class               0
air_pollution_score     0
city_mpg                0
hwy_mpg                 0
cmb_mpg                 0
greenhouse_gas_score    0
smartway                0
dtype: int64


In [64]:
#check if there in both datasets duplicated data
print(df_08.duplicated().sum())
print(df_18.duplicated().sum())

63
5


In [67]:
#drop any duplicated data from both datasets
df_08.drop_duplicates(inplace=True)
df_18.drop_duplicates(inplace=True)

In [68]:
#recheck if there in both datasets duplicated data
print(df_08.duplicated().sum())
print(df_18.duplicated().sum())

0
0


we droped missing data and duplicates data ,rename incorrect columns and query some result 