### Car sales program:
#### Carry out the following tasks in JupyterLab:

##### Read in the data file “car_price.csv” into Python. The data file contains 4 variables: Year, Make, Model, and Price (in USD).
##### Outer join the “car_model” and “car_price” data by their “year”, “make” and “model”.
##### Print the merged DataFrame and check on its dimension.
##### Delete duplicate columns from the DataFrame.


In [1]:
import pandas as pd

car_model = pd.read_csv("car_model.csv")
car_price = pd.read_csv("car_price.csv")

In [2]:
car_model.head()

Unnamed: 0,Year,Make,Model,Category
0,2021,Acura,ILX,Sedan
1,2021,Acura,RDX,SUV
2,2021,Acura,TLX,Sedan
3,2021,Alfa Romeo,Giulia,Sedan
4,2021,Alfa Romeo,Stelvio,SUV


In [3]:
car_price.head()

Unnamed: 0,Year,Make,Model,Price
0,2021,Acura,ILX,26100
1,2021,Acura,RDX,38400
2,2021,Acura,TLX,37500
3,2021,Alfa Romeo,Giulia,40350
4,2021,Alfa Romeo,Stelvio,42350


In [4]:
#car_model_price = pd.concat([car_model, car_price], axis = 0, join = "outer")
car_model_price = pd.merge(car_model, car_price, 
                           #how='outer', 
                           #on=['Year', 'Make', 'Model'],
                           #how='inner',
                           #on=['Year']
                          )
car_model_price.head()

Unnamed: 0,Year,Make,Model,Category,Price
0,2021,Acura,ILX,Sedan,26100
1,2021,Acura,RDX,SUV,38400
2,2021,Acura,TLX,Sedan,37500
3,2021,Alfa Romeo,Giulia,Sedan,40350
4,2021,Alfa Romeo,Stelvio,SUV,42350


In [5]:
car_model_price.shape

(240, 5)

Python has some guards against duplicate column names. Pandas, however, can be tricked into allowing duplicate column names. 

Duplicate column names are a problem especially if you plan to transfer your data set to another language. They’re can also cause problems in debug.

We are going to create a "fake" duplicate column for purposes of learning. Dont try this at home, kids.

In [6]:
car_model_price['Model2'] = car_model_price['Model']
car_model_price.rename(columns={'Model2':'Model'}, inplace=True)
car_model_price

Unnamed: 0,Year,Make,Model,Category,Price,Model.1
0,2021,Acura,ILX,Sedan,26100,ILX
1,2021,Acura,RDX,SUV,38400,RDX
2,2021,Acura,TLX,Sedan,37500,TLX
3,2021,Alfa Romeo,Giulia,Sedan,40350,Giulia
4,2021,Alfa Romeo,Stelvio,SUV,42350,Stelvio
...,...,...,...,...,...,...
235,2021,Volvo,V60,Wagon,40950,V60
236,2021,Volkswagen,Tiguan,SUV,25245,Tiguan
237,2021,Volvo,XC40,SUV,33700,XC40
238,2021,Volvo,XC60,SUV,41700,XC60


In [13]:
car_model_price2 = car_model_price.copy()
car_model_price2 = car_model_price2.drop(['Model'], axis = 1)
car_model_price2

Unnamed: 0,Year,Make,Category,Price
0,2021,Acura,Sedan,26100
1,2021,Acura,SUV,38400
2,2021,Acura,Sedan,37500
3,2021,Alfa Romeo,Sedan,40350
4,2021,Alfa Romeo,SUV,42350
...,...,...,...,...
235,2021,Volvo,Wagon,40950
236,2021,Volkswagen,SUV,25245
237,2021,Volvo,SUV,33700
238,2021,Volvo,SUV,41700


##### But how do we drop duplicated columns?

In [8]:
# How many duplicated coumns are there?
car_model_price.columns.duplicated()

array([False, False, False, False, False,  True])

The above line returns a boolean array: a True or False for each column. If it is False then the column name is unique up to that point, if it is True then the column name is duplicated earlier. 

Pandas allows one to index using boolean values whereby it selects only the True values. 

Since we want to keep the unduplicated columns, we need the above boolean array to be flipped i.e. we want to convert the False to True

In [9]:
~car_model_price.columns.duplicated()

array([ True,  True,  True,  True,  True, False])

In [10]:
car_model_price = car_model_price.loc[:, ~car_model_price.columns.duplicated()]
car_model_price

Unnamed: 0,Year,Make,Model,Category,Price
0,2021,Acura,ILX,Sedan,26100
1,2021,Acura,RDX,SUV,38400
2,2021,Acura,TLX,Sedan,37500
3,2021,Alfa Romeo,Giulia,Sedan,40350
4,2021,Alfa Romeo,Stelvio,SUV,42350
...,...,...,...,...,...
235,2021,Volvo,V60,Wagon,40950
236,2021,Volkswagen,Tiguan,SUV,25245
237,2021,Volvo,XC40,SUV,33700
238,2021,Volvo,XC60,SUV,41700


In [11]:
car_model_price.shape

(240, 5)

In [12]:
car_model_price.to_csv("car_model_price.csv", index = False)