![dlf.pt-python-logo-png-4429904.png](attachment:0f18586c-db28-4b2c-b10a-49222606400d.png)

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way toward this goal. <sup>[[1]](https://pandas.pydata.org/docs/getting_started/overview.html)</sup>.

The purpose of this study is to get familiar with Pandas library and it's capabilities using manually created DataFrame and [US Cars Dataset](https://www.kaggle.com/datasets/doaaalsenani/usa-cers-dataset) of Kaggle.

This study is based on the Data Analysis with Python - Full Course for Beginners [Youtube video](https://www.youtube.com/watch?v=r-uOLxNrNk8) from [freeCodeCamp.org](https://www.freecodecamp.org/) and [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html) by Jake VanderPlas.

We'll cover:
- [Create DataFrame](#create)
- [Explore DataFrame](#explore)
- [Statistics](#statistics)
- [Operations](#operations)
- [Conclusion](#conclusion)

[1] https://pandas.pydata.org/

# Importing Libraries

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/usa-cers-dataset/USA_cars_datasets.csv


# Creating a new Data Frame <a id="create"></a>

## Manually

In [2]:
df = pd.DataFrame({
    'Country': ['Canada',
            'France',
            'Germany',
            'Italy',
            'Japan',
            'United Kingdom',
            'United States'],
    'Population': [35.467, 63.951, 83.94, 60.665, 127.061, 64.511, 318.523],
    'GDP': [1785387, 2833687, 3874437, 2167744, 4602367, 2950039, 17348075],
    'Surface Area': [9984670, 640679, 357114, 301336, 377930, 242495, 9525067],
    'HDI': [0.913, 0.888, 0.916, 0.873, 0.891, 0.907, 0.915],
    'Continent': [
                  'America',
                  'Europe',
                  'Europe',
                  'Europe',
                  'Asia',
                  'Europe',
                  'America'
    ]
}, columns=['Country', 'Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])

In [3]:
df

Unnamed: 0,Country,Population,GDP,Surface Area,HDI,Continent
0,Canada,35.467,1785387,9984670,0.913,America
1,France,63.951,2833687,640679,0.888,Europe
2,Germany,83.94,3874437,357114,0.916,Europe
3,Italy,60.665,2167744,301336,0.873,Europe
4,Japan,127.061,4602367,377930,0.891,Asia
5,United Kingdom,64.511,2950039,242495,0.907,Europe
6,United States,318.523,17348075,9525067,0.915,America


## Reading from External Source

In this section we'll cover only reading data from a .csv file. It can be extended to Excel, JSON, DataBase and many more using the builtin functions.

In [4]:
df_cars = pd.read_csv('../input/usa-cers-dataset/USA_cars_datasets.csv')

In [5]:
df_cars

Unnamed: 0.1,Unnamed: 0,price,brand,model,year,title_status,mileage,color,vin,lot,state,country,condition
0,0,6300,toyota,cruiser,2008,clean vehicle,274117.0,black,jtezu11f88k007763,159348797,new jersey,usa,10 days left
1,1,2899,ford,se,2011,clean vehicle,190552.0,silver,2fmdk3gc4bbb02217,166951262,tennessee,usa,6 days left
2,2,5350,dodge,mpv,2018,clean vehicle,39590.0,silver,3c4pdcgg5jt346413,167655728,georgia,usa,2 days left
3,3,25000,ford,door,2014,clean vehicle,64146.0,blue,1ftfw1et4efc23745,167753855,virginia,usa,22 hours left
4,4,27700,chevrolet,1500,2018,clean vehicle,6654.0,red,3gcpcrec2jg473991,167763266,florida,usa,22 hours left
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2494,2494,7800,nissan,versa,2019,clean vehicle,23609.0,red,3n1cn7ap9kl880319,167722715,california,usa,1 days left
2495,2495,9200,nissan,versa,2018,clean vehicle,34553.0,silver,3n1cn7ap5jl884088,167762225,florida,usa,21 hours left
2496,2496,9200,nissan,versa,2018,clean vehicle,31594.0,silver,3n1cn7ap9jl884191,167762226,florida,usa,21 hours left
2497,2497,9200,nissan,versa,2018,clean vehicle,32557.0,black,3n1cn7ap3jl883263,167762227,florida,usa,2 days left


# Exploring the DataFrame <a id="explore"></a>

## Check the top 5 rows

In [6]:
df_cars.head()

Unnamed: 0.1,Unnamed: 0,price,brand,model,year,title_status,mileage,color,vin,lot,state,country,condition
0,0,6300,toyota,cruiser,2008,clean vehicle,274117.0,black,jtezu11f88k007763,159348797,new jersey,usa,10 days left
1,1,2899,ford,se,2011,clean vehicle,190552.0,silver,2fmdk3gc4bbb02217,166951262,tennessee,usa,6 days left
2,2,5350,dodge,mpv,2018,clean vehicle,39590.0,silver,3c4pdcgg5jt346413,167655728,georgia,usa,2 days left
3,3,25000,ford,door,2014,clean vehicle,64146.0,blue,1ftfw1et4efc23745,167753855,virginia,usa,22 hours left
4,4,27700,chevrolet,1500,2018,clean vehicle,6654.0,red,3gcpcrec2jg473991,167763266,florida,usa,22 hours left


## Check the top n rows

In [7]:
df_cars.head(10)

Unnamed: 0.1,Unnamed: 0,price,brand,model,year,title_status,mileage,color,vin,lot,state,country,condition
0,0,6300,toyota,cruiser,2008,clean vehicle,274117.0,black,jtezu11f88k007763,159348797,new jersey,usa,10 days left
1,1,2899,ford,se,2011,clean vehicle,190552.0,silver,2fmdk3gc4bbb02217,166951262,tennessee,usa,6 days left
2,2,5350,dodge,mpv,2018,clean vehicle,39590.0,silver,3c4pdcgg5jt346413,167655728,georgia,usa,2 days left
3,3,25000,ford,door,2014,clean vehicle,64146.0,blue,1ftfw1et4efc23745,167753855,virginia,usa,22 hours left
4,4,27700,chevrolet,1500,2018,clean vehicle,6654.0,red,3gcpcrec2jg473991,167763266,florida,usa,22 hours left
5,5,5700,dodge,mpv,2018,clean vehicle,45561.0,white,2c4rdgeg9jr237989,167655771,texas,usa,2 days left
6,6,7300,chevrolet,pk,2010,clean vehicle,149050.0,black,1gcsksea1az121133,167753872,georgia,usa,22 hours left
7,7,13350,gmc,door,2017,clean vehicle,23525.0,gray,1gks2gkc3hr326762,167692494,california,usa,20 hours left
8,8,14600,chevrolet,malibu,2018,clean vehicle,9371.0,silver,1g1zd5st5jf191860,167763267,florida,usa,22 hours left
9,9,5250,ford,mpv,2017,clean vehicle,63418.0,black,2fmpk3j92hbc12542,167656121,texas,usa,2 days left


## Check the bottom 5 rows

In [8]:
df_cars.tail()

Unnamed: 0.1,Unnamed: 0,price,brand,model,year,title_status,mileage,color,vin,lot,state,country,condition
2494,2494,7800,nissan,versa,2019,clean vehicle,23609.0,red,3n1cn7ap9kl880319,167722715,california,usa,1 days left
2495,2495,9200,nissan,versa,2018,clean vehicle,34553.0,silver,3n1cn7ap5jl884088,167762225,florida,usa,21 hours left
2496,2496,9200,nissan,versa,2018,clean vehicle,31594.0,silver,3n1cn7ap9jl884191,167762226,florida,usa,21 hours left
2497,2497,9200,nissan,versa,2018,clean vehicle,32557.0,black,3n1cn7ap3jl883263,167762227,florida,usa,2 days left
2498,2498,9200,nissan,versa,2018,clean vehicle,31371.0,silver,3n1cn7ap4jl884311,167762228,florida,usa,21 hours left


## Get n number of random samples

In [9]:
df_cars.sample(10)

Unnamed: 0.1,Unnamed: 0,price,brand,model,year,title_status,mileage,color,vin,lot,state,country,condition
2323,2323,7500,nissan,door,2016,clean vehicle,19723.0,black,3n1ab7ap9gy325347,167750212,georgia,usa,2 hours left
2203,2203,13800,ford,fusion,2019,clean vehicle,30902.0,no_color,3fa6p0lu8kr225585,167802381,north carolina,usa,2 days left
440,440,13000,kia,sportage,2017,clean vehicle,71991.0,silver,kndpn3ac7h7113313,167606881,new jersey,usa,2 hours left
911,911,20800,dodge,charger,2019,clean vehicle,35098.0,gray,2c3cdxhgxkh527140,167553404,california,usa,8 days left
2130,2130,26000,ford,explorer,2019,clean vehicle,32458.0,white,1fm5k7f83kga17399,167802226,north carolina,usa,2 days left
558,558,27502,ford,f-150,2019,clean vehicle,16146.0,gray,1ftew1e45kfb69856,167763407,michigan,usa,1 minutes
1592,1592,39950,ford,f-150,2019,clean vehicle,9278.0,white,1ftew1e42kfc12713,167565282,georgia,usa,4 days left
2036,2036,9705,ford,door,2017,clean vehicle,72377.0,black,3fa6p0h76hr319394,167744596,arkansas,usa,21 hours left
1797,1797,22800,ford,f-150,2019,clean vehicle,13002.0,white,1ftmf1c58kkc79282,167799934,texas,usa,2 days left
928,928,15400,dodge,caravan,2019,clean vehicle,41601.0,blue,2c4rdgeg5kr563175,167553585,california,usa,8 days left


## Get the shape of the DataFrame 

It's in format of (number of rows, number of columns). Below example shows 13 columns and 2499 rows.

In [10]:
df_cars.shape

(2499, 13)

## Get the total number of elements in the DataFrame

It returns the number of data points.

Size = number of rows * number of columns

In [11]:
df_cars.size

32487

## Get the columns in the DataFrame

It's in list format

In [12]:
df_cars.columns

Index(['Unnamed: 0', 'price', 'brand', 'model', 'year', 'title_status',
       'mileage', 'color', 'vin', 'lot', 'state', 'country', 'condition'],
      dtype='object')

## Get the index of the DataFrame

In [13]:
df_cars.index

RangeIndex(start=0, stop=2499, step=1)

## Get the data types of each column

Pandas autamatically assigns a data type for each columns through NumPy.

In [14]:
df_cars.dtypes

Unnamed: 0        int64
price             int64
brand            object
model            object
year              int64
title_status     object
mileage         float64
color            object
vin              object
lot               int64
state            object
country          object
condition        object
dtype: object

In [15]:
# Get number of each different data type in the DataFrame
df_cars.dtypes.value_counts()

object     8
int64      4
float64    1
dtype: int64

## Get information about the DataFrame

It provides quick structure information about the DataFrame. You can see the names of the columns, data types of the columns and how many null values each columns have.

In [16]:
df_cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2499 entries, 0 to 2498
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    2499 non-null   int64  
 1   price         2499 non-null   int64  
 2   brand         2499 non-null   object 
 3   model         2499 non-null   object 
 4   year          2499 non-null   int64  
 5   title_status  2499 non-null   object 
 6   mileage       2499 non-null   float64
 7   color         2499 non-null   object 
 8   vin           2499 non-null   object 
 9   lot           2499 non-null   int64  
 10  state         2499 non-null   object 
 11  country       2499 non-null   object 
 12  condition     2499 non-null   object 
dtypes: float64(1), int64(4), object(8)
memory usage: 253.9+ KB


In [17]:
# This can also be applied to a filter
df_cars[(df_cars['mileage'] > 100000)].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 327 entries, 0 to 2443
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    327 non-null    int64  
 1   price         327 non-null    int64  
 2   brand         327 non-null    object 
 3   model         327 non-null    object 
 4   year          327 non-null    int64  
 5   title_status  327 non-null    object 
 6   mileage       327 non-null    float64
 7   color         327 non-null    object 
 8   vin           327 non-null    object 
 9   lot           327 non-null    int64  
 10  state         327 non-null    object 
 11  country       327 non-null    object 
 12  condition     327 non-null    object 
dtypes: float64(1), int64(4), object(8)
memory usage: 35.8+ KB


## NULL Values

Get the total number of NULL data points for each column.

In [18]:
df_cars.isnull().sum()

Unnamed: 0      0
price           0
brand           0
model           0
year            0
title_status    0
mileage         0
color           0
vin             0
lot             0
state           0
country         0
condition       0
dtype: int64

## Get specific column(s)

In [19]:
df_cars['brand']

0          toyota
1            ford
2           dodge
3            ford
4       chevrolet
          ...    
2494       nissan
2495       nissan
2496       nissan
2497       nissan
2498       nissan
Name: brand, Length: 2499, dtype: object

In [20]:
df_cars[['brand', 'model']]

Unnamed: 0,brand,model
0,toyota,cruiser
1,ford,se
2,dodge,mpv
3,ford,door
4,chevrolet,1500
...,...,...
2494,nissan,versa
2495,nissan,versa
2496,nissan,versa
2497,nissan,versa


In [21]:
columns = ['brand', 'model', 'title_status']
df_cars[columns]

Unnamed: 0,brand,model,title_status
0,toyota,cruiser,clean vehicle
1,ford,se,clean vehicle
2,dodge,mpv,clean vehicle
3,ford,door,clean vehicle
4,chevrolet,1500,clean vehicle
...,...,...,...
2494,nissan,versa,clean vehicle
2495,nissan,versa,clean vehicle
2496,nissan,versa,clean vehicle
2497,nissan,versa,clean vehicle


In [22]:
# Get the selected column in DataFrame type.
df_cars['brand'].to_frame()

Unnamed: 0,brand
0,toyota
1,ford
2,dodge
3,ford
4,chevrolet
...,...
2494,nissan
2495,nissan
2496,nissan
2497,nissan


## Get specific row(s)

There are 2 approaches to get specicific rows. To show both, first we need to set the index first dataframe.

In [23]:
# Updating the index from numeric to string to provide more details.
# This is different than the Country column in the manually created DataFrame.
df.index = [
            'Canada',
            'France',
            'Germany',
            'Italy',
            'Japan',
            'United Kingdom',
            'United States'
]

In [24]:
df

Unnamed: 0,Country,Population,GDP,Surface Area,HDI,Continent
Canada,Canada,35.467,1785387,9984670,0.913,America
France,France,63.951,2833687,640679,0.888,Europe
Germany,Germany,83.94,3874437,357114,0.916,Europe
Italy,Italy,60.665,2167744,301336,0.873,Europe
Japan,Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,United Kingdom,64.511,2950039,242495,0.907,Europe
United States,United States,318.523,17348075,9525067,0.915,America


In [25]:
# Select individual row by index. Returns Pandas Series.
df.loc['France']

Country          France
Population       63.951
GDP             2833687
Surface Area     640679
HDI               0.888
Continent        Europe
Name: France, dtype: object

In [26]:
# Select individual row by sequential position. Returns Pandas Series.
df.iloc[1]

Country          France
Population       63.951
GDP             2833687
Surface Area     640679
HDI               0.888
Continent        Europe
Name: France, dtype: object

In [27]:
# Get the second untill fourth rows
df.iloc[1:3]

Unnamed: 0,Country,Population,GDP,Surface Area,HDI,Continent
France,France,63.951,2833687,640679,0.888,Europe
Germany,Germany,83.94,3874437,357114,0.916,Europe


In [28]:
# or the last row
df.iloc[-1]

Country         United States
Population            318.523
GDP                  17348075
Surface Area          9525067
HDI                     0.915
Continent             America
Name: United States, dtype: object

In [29]:
# Combined selection using indexing
df.loc['France':'Italy', ['Population', 'GDP']]

Unnamed: 0,Population,GDP
France,63.951,2833687
Germany,83.94,3874437
Italy,60.665,2167744


In [30]:
# Combined selection using sequential position
df.iloc[1:3, [0, 3]]
#df.iloc[1:3, 1:3] -> same as above

Unnamed: 0,Country,Surface Area
France,France,640679
Germany,Germany,357114


## Conditional Selection

In [31]:
# Get all rows based on the condition on a column of entire DataFrame
df_cars[(df_cars['mileage'] > 100000)]

Unnamed: 0.1,Unnamed: 0,price,brand,model,year,title_status,mileage,color,vin,lot,state,country,condition
0,0,6300,toyota,cruiser,2008,clean vehicle,274117.0,black,jtezu11f88k007763,159348797,new jersey,usa,10 days left
1,1,2899,ford,se,2011,clean vehicle,190552.0,silver,2fmdk3gc4bbb02217,166951262,tennessee,usa,6 days left
6,6,7300,chevrolet,pk,2010,clean vehicle,149050.0,black,1gcsksea1az121133,167753872,georgia,usa,22 hours left
10,10,10400,dodge,coupe,2009,clean vehicle,107856.0,orange,2b3lj54t49h509675,167753874,georgia,usa,22 hours left
13,13,5430,chrysler,wagon,2017,clean vehicle,138650.0,gray,2c4rc1cg5hr616095,167656123,texas,usa,2 days left
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2410,2410,2425,nissan,door,2012,salvage insurance,161836.0,black,jn8af5mr5ct111885,167616381,tennessee,usa,2 days left
2414,2414,2100,nissan,door,2013,salvage insurance,155704.0,silver,5n1ar2mn3dc619753,167616437,tennessee,usa,2 days left
2415,2415,800,nissan,door,2012,salvage insurance,234792.0,black,1n4al2ap9cc190456,167617515,florida,usa,2 days left
2418,2418,3000,nissan,door,2015,clean vehicle,132655.0,beige,1n4al3ap4fn890836,167755716,pennsylvania,usa,2 days left


In [32]:
# Get specific columns based on multiple condition
df_cars[(df_cars['mileage'] > 100000) & (df_cars['state'] == 'virginia')][['brand', 'model', 'mileage','price']]

Unnamed: 0,brand,model,mileage,price
14,ford,door,100757.0,20700
24,ford,door,105510.0,20800
356,gmc,mpv,117400.0,6300
383,chevrolet,door,194903.0,25
391,cadillac,coupe,105169.0,0
400,buick,door,137464.0,0
406,honda,van,109027.0,8160
1623,ford,door,163334.0,1350
1869,ford,cab,129873.0,6000
1984,ford,door,167376.0,1200


In [33]:
# Binnning to divide the data into a certain range
pd.cut(df_cars['price'], bins = 5).value_counts()

(-84.9, 16980.0]      1256
(16980.0, 33960.0]     995
(33960.0, 50940.0]     203
(50940.0, 67920.0]      42
(67920.0, 84900.0]       3
Name: price, dtype: int64

## Sorting

By passing `ascending=False` we sorted the DataFrame in descending order. 

In [34]:
df_cars[(df_cars['mileage'] > 100000) & (df_cars['state'] == 'virginia')][['brand', 'model', 'mileage','price']].sort_values(by='price', ascending=False)

Unnamed: 0,brand,model,mileage,price
24,ford,door,105510.0,20800
14,ford,door,100757.0,20700
406,honda,van,109027.0,8160
356,gmc,mpv,117400.0,6300
1869,ford,cab,129873.0,6000
1623,ford,door,163334.0,1350
1984,ford,door,167376.0,1200
383,chevrolet,door,194903.0,25
391,cadillac,coupe,105169.0,0
400,buick,door,137464.0,0


# Statistics <a id="statistics"></a>

## Get the summary of the statistics: numeric columns

In [35]:
df_cars.describe()

Unnamed: 0.1,Unnamed: 0,price,year,mileage,lot
count,2499.0,2499.0,2499.0,2499.0,2499.0
mean,1249.0,18767.671469,2016.714286,52298.69,167691400.0
std,721.543484,12116.094936,3.442656,59705.52,203877.2
min,0.0,0.0,1973.0,0.0,159348800.0
25%,624.5,10200.0,2016.0,21466.5,167625300.0
50%,1249.0,16900.0,2018.0,35365.0,167745100.0
75%,1873.5,25555.5,2019.0,63472.5,167779800.0
max,2498.0,84900.0,2020.0,1017936.0,167805500.0


## Get the summary of the statistics: non-numeric columns

It returns the number of unique values, value with the highest number and it's frequency.

In [36]:
df_cars.describe(include='object')

Unnamed: 0,brand,model,title_status,color,vin,state,country,condition
count,2499,2499,2499,2499,2499,2499,2499,2499
unique,28,127,2,49,2495,44,2,47
top,ford,door,clean vehicle,white,1gnevhkw8jj148388,pennsylvania,usa,2 days left
freq,1235,651,2336,707,2,299,2492,832


## Get the summary of the statistics: both numeric and non-numeric columns

In [37]:
df_cars.describe(include='all')

Unnamed: 0.1,Unnamed: 0,price,brand,model,year,title_status,mileage,color,vin,lot,state,country,condition
count,2499.0,2499.0,2499,2499,2499.0,2499,2499.0,2499,2499,2499.0,2499,2499,2499
unique,,,28,127,,2,,49,2495,,44,2,47
top,,,ford,door,,clean vehicle,,white,1gnevhkw8jj148388,,pennsylvania,usa,2 days left
freq,,,1235,651,,2336,,707,2,,299,2492,832
mean,1249.0,18767.671469,,,2016.714286,,52298.69,,,167691400.0,,,
std,721.543484,12116.094936,,,3.442656,,59705.52,,,203877.2,,,
min,0.0,0.0,,,1973.0,,0.0,,,159348800.0,,,
25%,624.5,10200.0,,,2016.0,,21466.5,,,167625300.0,,,
50%,1249.0,16900.0,,,2018.0,,35365.0,,,167745100.0,,,
75%,1873.5,25555.5,,,2019.0,,63472.5,,,167779800.0,,,


In [38]:
# Decsribe function returns the value in DataFrame format. 
# Which means that you can apply all builtin functions to the result of the describe function as well.
type(df_cars.describe(include='all'))

pandas.core.frame.DataFrame

In [39]:
df_cars.describe().loc[['min', 'mean', 'max']]

Unnamed: 0.1,Unnamed: 0,price,year,mileage,lot
min,0.0,0.0,1973.0,0.0,159348800.0
mean,1249.0,18767.671469,2016.714286,52298.69,167691400.0
max,2498.0,84900.0,2020.0,1017936.0,167805500.0


In [40]:
# Get the summary of statistics for specific columns and transpose it
df_cars[['price', 'mileage']].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price,2499.0,18767.671469,12116.094936,0.0,10200.0,16900.0,25555.5,84900.0
mileage,2499.0,52298.685474,59705.516356,0.0,21466.5,35365.0,63472.5,1017936.0


## Get basic statistics of specific column

- Minimum
- Maximum
- Mean
- Median
- Standard Deviation 
- Quantile

In [41]:
df_cars['mileage'].min()

0.0

In [42]:
df_cars['mileage'].max()

1017936.0

In [43]:
# Mean
df_cars['mileage'].mean()

52298.685474189675

In [44]:
# Standard deviation
df_cars['mileage'].std()

59705.51635643581

In [45]:
df_cars['mileage'].median()

35365.0

In [46]:
df_cars['mileage'].quantile(.25)

21466.5

In [47]:
df_cars['mileage'].quantile([.2, .4, .6, .8])

0.2    18275.8
0.4    30629.2
0.6    41202.0
0.8    77459.0
Name: mileage, dtype: float64

In [48]:
# Get n largest values
df_cars.nlargest(3, 'mileage')

Unnamed: 0.1,Unnamed: 0,price,brand,model,year,title_status,mileage,color,vin,lot,state,country,condition
528,528,1025,peterbilt,truck,2010,salvage insurance,1017936.0,color:,1xp7d49x0ad793710,167529842,georgia,usa,17 hours left
1827,1827,3200,ford,door,2013,clean vehicle,999999.0,silver,1fadp3k21dl266148,167727773,south carolina,usa,21 hours left
516,516,0,peterbilt,truck,2009,salvage insurance,982486.0,blue,1xp7d49x09d784257,167529788,florida,usa,17 hours left


In [49]:
# Get n smallest values
df_cars.nsmallest(3, 'mileage')

Unnamed: 0.1,Unnamed: 0,price,brand,model,year,title_status,mileage,color,vin,lot,state,country,condition
309,309,0,chevrolet,door,2004,salvage insurance,0.0,maroon,3gnek12t74g240524,167418651,wyoming,usa,18 hours left
322,322,0,ford,chassis,1994,salvage insurance,0.0,green,1fdee14n7rha47894,167359174,california,usa,19 hours left
504,504,100,peterbilt,truck,2012,salvage insurance,0.0,blue,1xp4d49x1cd144875,167529787,florida,usa,17 hours left


### Get the specific row of min/max of specific column

It returns in series format

In [50]:
df_cars.iloc[df_cars['mileage'].idxmax()]
# df_cars.iloc[df_cars['mileage'].idxmin()] -> for min

Unnamed: 0                      528
price                          1025
brand                     peterbilt
model                         truck
year                           2010
title_status      salvage insurance
mileage                   1017936.0
color                        color:
vin               1xp7d49x0ad793710
lot                       167529842
state                       georgia
country                         usa
condition             17 hours left
Name: 528, dtype: object

## Non-numeric Columns

In [51]:
# Get the frequency of each object in specific column
df_cars['brand'].value_counts()

ford               1235
dodge               432
nissan              312
chevrolet           297
gmc                  42
jeep                 30
chrysler             18
bmw                  17
hyundai              15
kia                  13
buick                13
infiniti             12
honda                12
cadillac             10
mercedes-benz        10
heartland             5
land                  4
peterbilt             4
audi                  4
acura                 3
lincoln               2
lexus                 2
mazda                 2
maserati              1
toyota                1
harley-davidson       1
jaguar                1
ram                   1
Name: brand, dtype: int64

In [52]:
# Get the list of the unique objects in specific column
df_cars['brand'].unique()

array(['toyota', 'ford', 'dodge', 'chevrolet', 'gmc', 'chrysler', 'kia',
       'buick', 'infiniti', 'mercedes-benz', 'jeep', 'bmw', 'cadillac',
       'hyundai', 'mazda', 'honda', 'heartland', 'jaguar', 'acura',
       'harley-davidson', 'audi', 'lincoln', 'lexus', 'nissan', 'land',
       'maserati', 'peterbilt', 'ram'], dtype=object)

In [53]:
# Get the number of unique objects in specific column
df_cars['brand'].nunique()
# df_cars['brand'].unique().size -> same as above

28

## Pivot Tables

### Pivot Tables by Hand

In [54]:
df_cars.groupby('brand')[['price']].mean()

Unnamed: 0_level_0,price
brand,Unnamed: 1_level_1
acura,7266.666667
audi,13981.25
bmw,26397.058824
buick,19715.769231
cadillac,24941.0
chevrolet,18669.952862
chrysler,13686.111111
dodge,17781.988426
ford,21666.888259
gmc,10657.380952


In [55]:
df_cars.groupby('brand')[['price']].sum()

Unnamed: 0_level_0,price
brand,Unnamed: 1_level_1
acura,21800
audi,55925
bmw,448750
buick,256305
cadillac,249410
chevrolet,5544976
chrysler,246350
dodge,7681819
ford,26758607
gmc,447610


In [56]:
df_cars.groupby(['brand', 'model'])[['price']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,price
brand,model,Unnamed: 2_level_1
acura,door,2450.0
acura,mdx,16900.0
audi,5,36400.0
audi,door,12.5
audi,q5,19500.0
...,...,...
nissan,xd,36300.0
nissan,xterra,6700.0
peterbilt,truck,400.0
ram,door,11050.0


### Pivot Table Syntax

In [57]:
df_cars.pivot_table('price', index=['brand', 'model'], aggfunc='mean', dropna=True, columns='year')

Unnamed: 0_level_0,year,1973,1984,1993,1994,1995,1996,1997,1998,1999,2000,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
brand,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
acura,door,,,,,,,,,,,...,,,,,,,,,,
acura,mdx,,,,,,,,,,,...,,,,16900.0,,,,,,
audi,5,,,,,,,,,,,...,,,,,36400.0,,,,,
audi,door,,,,,,,,,,,...,,,,,,,,,,
audi,q5,,,,,,,,,,,...,,,,,,,19500.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
nissan,xd,,,,,,,,,,,...,,,,,,,36300.0,,,
nissan,xterra,,,,,,,,,,,...,,6700.0,,,,,,,,
peterbilt,truck,,,,,,,,,,,...,,287.5,,,,,,,,
ram,door,,,,,,,,,,,...,,,,,,,11050.0,,,


# Operations <a id="operations"></a>

- Index operations
- Column operations
- Row Operations

## Index Operations

In [58]:
# Renaming indexes
# This operation is immutable. That's why it has to be assigned to a DataFrame. 
# In this example we'll assign it to the existing `df` DataFrame
# You can also send inplace=True as a parameter to function instead of assigning the result of the operation to a DataFrame.
df = df.rename(
    index={
        'United States': 'USA',
        'United Kingdom': 'UK',
        'Argentina': 'AR'  # This doesn't cause any problem even if it's not exist.
    }
)
df

Unnamed: 0,Country,Population,GDP,Surface Area,HDI,Continent
Canada,Canada,35.467,1785387,9984670,0.913,America
France,France,63.951,2833687,640679,0.888,Europe
Germany,Germany,83.94,3874437,357114,0.916,Europe
Italy,Italy,60.665,2167744,301336,0.873,Europe
Japan,Japan,127.061,4602367,377930,0.891,Asia
UK,United Kingdom,64.511,2950039,242495,0.907,Europe
USA,United States,318.523,17348075,9525067,0.915,America


In [59]:
# Another example using string operations.
df = df.rename(index=str.upper)
df

Unnamed: 0,Country,Population,GDP,Surface Area,HDI,Continent
CANADA,Canada,35.467,1785387,9984670,0.913,America
FRANCE,France,63.951,2833687,640679,0.888,Europe
GERMANY,Germany,83.94,3874437,357114,0.916,Europe
ITALY,Italy,60.665,2167744,301336,0.873,Europe
JAPAN,Japan,127.061,4602367,377930,0.891,Asia
UK,United Kingdom,64.511,2950039,242495,0.907,Europe
USA,United States,318.523,17348075,9525067,0.915,America


In [60]:
# Replace the index with auto-incremenral numbers and set the previous index as new column with the name "index". 
# drop=True deletes the current index
# Immutable operation. 
# You can also send inplace=True as a parameter to function instead of assigning the result of the operation to a DataFrame.
df = df.reset_index(drop=True)
df

Unnamed: 0,Country,Population,GDP,Surface Area,HDI,Continent
0,Canada,35.467,1785387,9984670,0.913,America
1,France,63.951,2833687,640679,0.888,Europe
2,Germany,83.94,3874437,357114,0.916,Europe
3,Italy,60.665,2167744,301336,0.873,Europe
4,Japan,127.061,4602367,377930,0.891,Asia
5,United Kingdom,64.511,2950039,242495,0.907,Europe
6,United States,318.523,17348075,9525067,0.915,America


In [61]:
# Change the index of the DataFrame to another column. 
# Immutable operation. To see the change we need to assign the result of the operation to a DataFrame
# You can also send inplace=True as a parameter to function instead of assigning the result of the operation to a DataFrame.
df = df.set_index('Country')
df

Unnamed: 0_level_0,Population,GDP,Surface Area,HDI,Continent
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,83.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


## Column Operations

In [62]:
# Immutable operation
df = df.rename(
    columns={
        'HDI': 'Human Development Index',
        'Annual Popcorn Consumption': 'APC' # This doesn't cause any problem even if it's not exist.
    }
)
df

Unnamed: 0_level_0,Population,GDP,Surface Area,Human Development Index,Continent
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,83.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [63]:
# Dropping specific columns. Immutable
df.drop(columns=['Population', 'Human Development Index'])
#df.drop(['Population', 'HDI'], axis=1) -> same as above
#df.drop(['Population', 'HDI'], axis='columns') -> same as above

Unnamed: 0_level_0,GDP,Surface Area,Continent
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Canada,1785387,9984670,America
France,2833687,640679,Europe
Germany,3874437,357114,Europe
Italy,2167744,301336,Europe
Japan,4602367,377930,Asia
United Kingdom,2950039,242495,Europe
United States,17348075,9525067,America


In [64]:
# Adding a new column 
langs = pd.Series(
    ['French', 'German', 'Italian'],
    index=['France', 'Germany', 'Italy'],
    name='Language')
df['Languages'] = langs
df

Unnamed: 0_level_0,Population,GDP,Surface Area,Human Development Index,Continent,Languages
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Canada,35.467,1785387,9984670,0.913,America,
France,63.951,2833687,640679,0.888,Europe,French
Germany,83.94,3874437,357114,0.916,Europe,German
Italy,60.665,2167744,301336,0.873,Europe,Italian
Japan,127.061,4602367,377930,0.891,Asia,
United Kingdom,64.511,2950039,242495,0.907,Europe,
United States,318.523,17348075,9525067,0.915,America,


In [65]:
# Creating columns from other columns
df['GDP Per Capita'] = df['GDP'] / df['Population']
df

Unnamed: 0_level_0,Population,GDP,Surface Area,Human Development Index,Continent,Languages,GDP Per Capita
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Canada,35.467,1785387,9984670,0.913,America,,50339.385908
France,63.951,2833687,640679,0.888,Europe,French,44310.284437
Germany,83.94,3874437,357114,0.916,Europe,German,46157.219442
Italy,60.665,2167744,301336,0.873,Europe,Italian,35733.025633
Japan,127.061,4602367,377930,0.891,Asia,,36221.712406
United Kingdom,64.511,2950039,242495,0.907,Europe,,45729.239975
United States,318.523,17348075,9525067,0.915,America,,54464.12033


In [66]:
df[['Population', 'GDP']] / 100

Unnamed: 0_level_0,Population,GDP
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Canada,0.35467,17853.87
France,0.63951,28336.87
Germany,0.8394,38744.37
Italy,0.60665,21677.44
Japan,1.27061,46023.67
United Kingdom,0.64511,29500.39
United States,3.18523,173480.75


## Row Operations

In [67]:
# Dropping a specific row. 
# Immutable. 
# You can also send inplace=True as a parameter to function instead of assigning the result of the operation to a DataFrame.
df.drop('Canada')
#df.drop(['Canada', 'Japan']) for multiple rows.
#df.drop(['Italy', 'Canada'], axis=0) -> same as above
#df.drop(['Italy', 'Canada'], axis='rows') -> same as above

Unnamed: 0_level_0,Population,GDP,Surface Area,Human Development Index,Continent,Languages,GDP Per Capita
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
France,63.951,2833687,640679,0.888,Europe,French,44310.284437
Germany,83.94,3874437,357114,0.916,Europe,German,46157.219442
Italy,60.665,2167744,301336,0.873,Europe,Italian,35733.025633
Japan,127.061,4602367,377930,0.891,Asia,,36221.712406
United Kingdom,64.511,2950039,242495,0.907,Europe,,45729.239975
United States,318.523,17348075,9525067,0.915,America,,54464.12033


In [68]:
# Adding a new row to the DataFrame.
df.loc['China'] = pd.Series({'Population': 1_400_000_000, 'Continent': 'Asia'})
df

Unnamed: 0_level_0,Population,GDP,Surface Area,Human Development Index,Continent,Languages,GDP Per Capita
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Canada,35.467,1785387.0,9984670.0,0.913,America,,50339.385908
France,63.951,2833687.0,640679.0,0.888,Europe,French,44310.284437
Germany,83.94,3874437.0,357114.0,0.916,Europe,German,46157.219442
Italy,60.665,2167744.0,301336.0,0.873,Europe,Italian,35733.025633
Japan,127.061,4602367.0,377930.0,0.891,Asia,,36221.712406
United Kingdom,64.511,2950039.0,242495.0,0.907,Europe,,45729.239975
United States,318.523,17348075.0,9525067.0,0.915,America,,54464.12033
China,1400000000.0,,,,Asia,,


# Conclusion <a id="conclusion"></a>

We tried to cover same basic functionalities of Pandas library of Python. The capabilities of Pandas pandas library are not limited to the modules/functions explained in this study. Please refer to the [API documentation](https://pandas.pydata.org/docs/reference/index.html) for more information and API updates.