<a href="https://colab.research.google.com/github/harshaljanjani/everything-ml/blob/main/Data%20Analysis%20With%20Python/Data%20Analysis%20-%20Pandas%20In-Depth%20Review%20(Day%208).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Pandas: Data Science Python Package**

## 1) **Pandas:** Series
## 2) **Pandas:** Dataframes
## 3) **Pandas:** Reading CSV And Basic Plotting

In [4]:
import pandas as pd
import numpy as np

# **Pandas Series**
**Example Dataset Used:** `"The Group of Seven"` is a political forum formed by Canada, France, Germany, Italy, Japan, the United Kingdom and the United States. Analyze the population, and for that, use a `pandas.Series` object.

In [5]:
# Population Of The Group Of Seven In Millions
g7_pop = pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])
g7_pop.name = "G7 Population In Millions"
print(g7_pop)
# Series - Ordered And Indexed Sequence Of Elements. Underlying Data Structure -> NumPy Array

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population In Millions, dtype: float64


In [6]:
print(type(g7_pop.values))
print(g7_pop.index)

<class 'numpy.ndarray'>
RangeIndex(start=0, stop=7, step=1)


In [7]:
# Customized Indexes In A Pandas Series (g7_population.index)
g7_pop.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]
print(g7_pop)

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population In Millions, dtype: float64


In [8]:
# Create a Pandas Series From A Dictionary. Also Called "Ordered Dictionaries" (Dictionaries Before Python 3.7 Are Unordered)
pd.Series({
    'Canada': 35.467,
    'France': 63.951,
    'Germany': 80.94,
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.523
}, name='G7 Population in millions')

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

## **Indexing**

In [9]:
# Indexing works similarly to lists and dictionaries, you use the index of the element you're looking for:
print(g7_pop['Canada'], g7_pop['Japan'])
# Numeric positions can also be used, with the 'iloc' attribute:
print(g7_pop.iloc[-1], g7_pop.iloc[0]) # (Locate By Sequential Position in the "Ordered Dictionary")
# Multiple Elements Selection / Sequential Multi-Indexing
print(g7_pop[['Italy', 'France']])

35.467 127.061
318.523 35.467
Italy     60.665
France    63.951
Name: G7 Population In Millions, dtype: float64


In [10]:
print(g7_pop['Canada': 'Italy'], "\n") # Upper Limit Is Also Included In The Slicing, Unlike Vanilla Python
print(g7_pop.iloc[[0, 1]])

Canada     35.467
France     63.951
Germany    80.940
Italy      60.665
Name: G7 Population In Millions, dtype: float64 

Canada    35.467
France    63.951
Name: G7 Population In Millions, dtype: float64


## **Conditional Selection (Boolean Series)**

In [11]:
print(g7_pop[g7_pop > g7_pop.mean()], "\n")
print(g7_pop[g7_pop > 70], "\n")
print(g7_pop * 1_000_000, "\n")
print(g7_pop.mean())

Japan            127.061
United States    318.523
Name: G7 Population In Millions, dtype: float64 

Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population In Millions, dtype: float64 

Canada             35467000.0
France             63951000.0
Germany            80940000.0
Italy              60665000.0
Japan             127061000.0
United Kingdom     64511000.0
United States     318523000.0
Name: G7 Population In Millions, dtype: float64 

107.30257142857144


In [12]:
print(g7_pop['France': 'Italy'].mean(), "\n")
print(g7_pop[(g7_pop > 80) | (g7_pop < 40)], "\n")
print(g7_pop[(g7_pop > 80) & (g7_pop < 200)], "\n")

print(g7_pop[(g7_pop > g7_pop.mean() - g7_pop.std() / 2) | (g7_pop > g7_pop.mean() + g7_pop.std() / 2)])

68.51866666666666 

Canada            35.467
Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population In Millions, dtype: float64 

Germany     80.940
Japan      127.061
Name: G7 Population In Millions, dtype: float64 

France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population In Millions, dtype: float64


## **Series Modification**

In [13]:
g7_pop[g7_pop < 70] = 99.99
g7_pop.iloc[-1] = 500
g7_pop['Canada'] = 40.5
print(g7_pop)

Canada             40.500
France             99.990
Germany            80.940
Italy              99.990
Japan             127.061
United Kingdom     99.990
United States     500.000
Name: G7 Population In Millions, dtype: float64


# **Pandas Dataframes**

In [14]:
# Usually Created From ".csv" Files Or By Scraping Data From The Web
# Pandas Dataframes -> Collection Of Pandas Series'
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Surface Area': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continent': [
        'America',
        'Europe',
        'Europe',
        'Europe',
        'Asia',
        'Europe',
        'America'
    ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])

In [15]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
0,35.467,1785387,9984670,0.913,America
1,63.951,2833687,640679,0.888,Europe
2,80.94,3874437,357114,0.916,Europe
3,60.665,2167744,301336,0.873,Europe
4,127.061,4602367,377930,0.891,Asia
5,64.511,2950039,242495,0.907,Europe
6,318.523,17348075,9525067,0.915,America


In [16]:
# Utility Functions And Useful Attributes
print(df.index, "\n")
print(df.columns, "\n")
print(df.info(), "\n")
print(df.size)
print(df.shape, "\n")
print(df.describe(), "\n") # Statistical Analysis Method
print(df.dtypes, "\n")
print(df.dtypes.value_counts())

RangeIndex(start=0, stop=7, step=1) 

Index(['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'], dtype='object') 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Population    7 non-null      float64
 1   GDP           7 non-null      int64  
 2   Surface Area  7 non-null      int64  
 3   HDI           7 non-null      float64
 4   Continent     7 non-null      object 
dtypes: float64(2), int64(2), object(1)
memory usage: 408.0+ bytes
None 

35
(7, 5) 

       Population           GDP  Surface Area       HDI
count    7.000000  7.000000e+00  7.000000e+00  7.000000
mean   107.302571  5.080248e+06  3.061327e+06  0.900429
std     97.249970  5.494020e+06  4.576187e+06  0.016592
min     35.467000  1.785387e+06  2.424950e+05  0.873000
25%     62.308000  2.500716e+06  3.292250e+05  0.889500
50%     64.511000  2.950039e+06  3.779300e+05  0.907000


## **Indexing, Selection And Slicing In Dataframes**

In [17]:
df.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]

In [18]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [19]:
# Return Value Of The Utilities .loc[] and .iloc[] -> Pandas Series (Index Of The Series Is Extracted As The "Name")
# Such Operations Are Deemed "Immutable" -> Create New DataFrame, Not Modify Underlying DataFrame

print(df.loc['Canada'], "\n") # Select By Index -> df.loc[] (Horizontal)
print(df.iloc[-1], "\n") # Select By Sequential Position -> df.iloc[] (Horizontal)
print(df['Population'], "\n") # Give The Entire Column Data (Vertical)

print(df.loc['France': 'Italy'], "\n")
print(df.loc['France': 'Italy', 'Population'], "\n")
print(df.iloc[[0, 1, -1]], "\n") #Selects Sequential Positions 0, 1 and -1

print(df.iloc[1:3, 3])

Population       35.467
GDP             1785387
Surface Area    9984670
HDI               0.913
Continent       America
Name: Canada, dtype: object 

Population       318.523
GDP             17348075
Surface Area     9525067
HDI                0.915
Continent        America
Name: United States, dtype: object 

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Population, dtype: float64 

         Population      GDP  Surface Area    HDI Continent
France       63.951  2833687        640679  0.888    Europe
Germany      80.940  3874437        357114  0.916    Europe
Italy        60.665  2167744        301336  0.873    Europe 

France     63.951
Germany    80.940
Italy      60.665
Name: Population, dtype: float64 

               Population       GDP  Surface Area    HDI Continent
Canada             35.467   1785387       9984670  0.913   America
France

In [20]:
df.loc['France': 'Italy', ['Population', 'GDP']] # Selection Of Columns By Index/Name

Unnamed: 0,Population,GDP
France,63.951,2833687
Germany,80.94,3874437
Italy,60.665,2167744


In [21]:
df.iloc[1:3, [0, 3]] # Selection Of Columns By Sequential Position

Unnamed: 0,Population,HDI
France,63.951,0.888
Germany,80.94,0.916


In [22]:
df.iloc[1:3, 1:3] # Upper Limit Is Included

Unnamed: 0,GDP,Surface Area
France,2833687,640679
Germany,3874437,357114


## **Conditional Selection (Boolean Arrays)**

In [23]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [24]:
print(df['Population'] > 70, "\n") # Mask Series
print(df.loc[df['Population'] > 70, ["Population", "GDP"]])

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: Population, dtype: bool 

               Population       GDP
Germany            80.940   3874437
Japan             127.061   4602367
United States     318.523  17348075


In [25]:
# Delete Existing Entries By Index Name/Value
print(df.drop('Canada'), "\n") # Return Value Is The Modified Dataframe Without The Deleted Row/Record
print(df.drop(['Canada', 'Japan']), "\n")
# Method-2
print(df.drop(['Italy', 'Canada'], axis=0))

                Population       GDP  Surface Area    HDI Continent
France              63.951   2833687        640679  0.888    Europe
Germany             80.940   3874437        357114  0.916    Europe
Italy               60.665   2167744        301336  0.873    Europe
Japan              127.061   4602367        377930  0.891      Asia
United Kingdom      64.511   2950039        242495  0.907    Europe
United States      318.523  17348075       9525067  0.915   America 

                Population       GDP  Surface Area    HDI Continent
France              63.951   2833687        640679  0.888    Europe
Germany             80.940   3874437        357114  0.916    Europe
Italy               60.665   2167744        301336  0.873    Europe
United Kingdom      64.511   2950039        242495  0.907    Europe
United States      318.523  17348075       9525067  0.915   America 

                Population       GDP  Surface Area    HDI Continent
France              63.951   2833687        

In [26]:
# Delete Attributes/Columns By Column Name
print(df.drop(columns=['Population', 'HDI']), "\n")
print(df.drop(['Population', 'HDI'], axis=1))

                     GDP  Surface Area Continent
Canada           1785387       9984670   America
France           2833687        640679    Europe
Germany          3874437        357114    Europe
Italy            2167744        301336    Europe
Japan            4602367        377930      Asia
United Kingdom   2950039        242495    Europe
United States   17348075       9525067   America 

                     GDP  Surface Area Continent
Canada           1785387       9984670   America
France           2833687        640679    Europe
Germany          3874437        357114    Europe
Italy            2167744        301336    Europe
Japan            4602367        377930      Asia
United Kingdom   2950039        242495    Europe
United States   17348075       9525067   America


In [27]:
# Method-3: Delete Attributes/Columns By Column Name
df.drop(['Population', 'HDI'], axis='columns')

Unnamed: 0,GDP,Surface Area,Continent
Canada,1785387,9984670,America
France,2833687,640679,Europe
Germany,3874437,357114,Europe
Italy,2167744,301336,Europe
Japan,4602367,377930,Asia
United Kingdom,2950039,242495,Europe
United States,17348075,9525067,America


In [28]:
# Method-3: Delete Existing Entries By Index Name/Value
df.drop(['Canada', 'Germany'], axis='rows')

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


All these drop methods return a new DataFrame. If it is to be modified  `"in place"`, you can use the `inplace` attribute of the `.drop()` function/method.

Operations with Pandas Series work at a `column level`, `broadcasting down the rows`.

In [29]:
df[['Population', 'GDP']] / 100

Unnamed: 0,Population,GDP
Canada,0.35467,17853.87
France,0.63951,28336.87
Germany,0.8094,38744.37
Italy,0.60665,21677.44
Japan,1.27061,46023.67
United Kingdom,0.64511,29500.39
United States,3.18523,173480.75


In [30]:
# Broadcasting Operations Between Series And Dataframes
crisis = pd.Series([-1_000_000, -0.3], index=['GDP', 'HDI'])
crisis

GDP   -1000000.0
HDI         -0.3
dtype: float64

In [31]:
df[['GDP', 'HDI']] + crisis # Operations With Pandas Series Work At A Column-Level, Broadcasting Down The Rows

Unnamed: 0,GDP,HDI
Canada,785387.0,0.613
France,1833687.0,0.588
Germany,2874437.0,0.616
Italy,1167744.0,0.573
Japan,3602367.0,0.591
United Kingdom,1950039.0,0.607
United States,16348075.0,0.615


## **Modifying Dataframes**

## 1) Adding A New Column To A Dataframe
## 2) Replacing Values Per Column
## 3) Renaming Columns
## 4) Dropping Columns
## 5) Adding Rows/Entries
## 6) More Radical Index Changes

In [32]:
# Task: Add Column 'Languages' To The Existing Dataframe, With Values Assigned For France, Germany And Italy
langs = pd.Series(
    ['French', 'German', 'Italian'],
    index=['France', 'Germany', 'Italy'],
    name='Language'
)
langs

France      French
Germany     German
Italy      Italian
Name: Language, dtype: object

In [33]:
df['Language'] = langs
df # Unassigned Countries Default to NaN

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,
France,63.951,2833687,640679,0.888,Europe,French
Germany,80.94,3874437,357114,0.916,Europe,German
Italy,60.665,2167744,301336,0.873,Europe,Italian
Japan,127.061,4602367,377930,0.891,Asia,
United Kingdom,64.511,2950039,242495,0.907,Europe,
United States,318.523,17348075,9525067,0.915,America,


In [34]:
df['Language'] = 'English'
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,English
Germany,80.94,3874437,357114,0.916,Europe,English
Italy,60.665,2167744,301336,0.873,Europe,English
Japan,127.061,4602367,377930,0.891,Asia,English
United Kingdom,64.511,2950039,242495,0.907,Europe,English
United States,318.523,17348075,9525067,0.915,America,English


In [35]:
df.rename(
    columns={
        'HDI': 'Human Development Index', # Present
        'Annual Popcorn Consumption': 'APC' # Absent
    }, 
    index={
        'United States': 'USA', # Present
        'United Kingdom': 'UK', # Present
        'Argentina': 'AR' # Absent
    }
) # All Indexes/Columns That Are Not Present, Would Not Be Added/Modified

Unnamed: 0,Population,GDP,Surface Area,Human Development Index,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,English
Germany,80.94,3874437,357114,0.916,Europe,English
Italy,60.665,2167744,301336,0.873,Europe,English
Japan,127.061,4602367,377930,0.891,Asia,English
UK,64.511,2950039,242495,0.907,Europe,English
USA,318.523,17348075,9525067,0.915,America,English


In [36]:
df.rename(index=str.upper)

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
CANADA,35.467,1785387,9984670,0.913,America,English
FRANCE,63.951,2833687,640679,0.888,Europe,English
GERMANY,80.94,3874437,357114,0.916,Europe,English
ITALY,60.665,2167744,301336,0.873,Europe,English
JAPAN,127.061,4602367,377930,0.891,Asia,English
UNITED KINGDOM,64.511,2950039,242495,0.907,Europe,English
UNITED STATES,318.523,17348075,9525067,0.915,America,English


In [37]:
df.rename(index = lambda x: x.lower())

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
canada,35.467,1785387,9984670,0.913,America,English
france,63.951,2833687,640679,0.888,Europe,English
germany,80.94,3874437,357114,0.916,Europe,English
italy,60.665,2167744,301336,0.873,Europe,English
japan,127.061,4602367,377930,0.891,Asia,English
united kingdom,64.511,2950039,242495,0.907,Europe,English
united states,318.523,17348075,9525067,0.915,America,English


In [38]:
df.drop(columns='Language', inplace=True)

In [39]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [40]:
df.append(pd.Series({
    'Population': 3,
    'GDP': 5
}, name='China')) # All Unassigned Columns Default To NaN

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387.0,9984670.0,0.913,America
France,63.951,2833687.0,640679.0,0.888,Europe
Germany,80.94,3874437.0,357114.0,0.916,Europe
Italy,60.665,2167744.0,301336.0,0.873,Europe
Japan,127.061,4602367.0,377930.0,0.891,Asia
United Kingdom,64.511,2950039.0,242495.0,0.907,Europe
United States,318.523,17348075.0,9525067.0,0.915,America
China,3.0,5.0,,,


To `directly set` the new index and values to the existing DataFrame:

In [41]:
df.loc['China'] = pd.Series({'Population': 1_400_000_000, 'Continent': 'Asia'})
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387.0,9984670.0,0.913,America
France,63.951,2833687.0,640679.0,0.888,Europe
Germany,80.94,3874437.0,357114.0,0.916,Europe
Italy,60.665,2167744.0,301336.0,0.873,Europe
Japan,127.061,4602367.0,377930.0,0.891,Asia
United Kingdom,64.511,2950039.0,242495.0,0.907,Europe
United States,318.523,17348075.0,9525067.0,0.915,America
China,1400000000.0,,,,Asia


In [42]:
df.drop('China', inplace=True)
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387.0,9984670.0,0.913,America
France,63.951,2833687.0,640679.0,0.888,Europe
Germany,80.94,3874437.0,357114.0,0.916,Europe
Italy,60.665,2167744.0,301336.0,0.873,Europe
Japan,127.061,4602367.0,377930.0,0.891,Asia
United Kingdom,64.511,2950039.0,242495.0,0.907,Europe
United States,318.523,17348075.0,9525067.0,0.915,America


In [43]:
# More Radical Index Changes -> .reset_index(), .set_index() -> Return Value Is The Modified Dataframe
df.reset_index()

Unnamed: 0,index,Population,GDP,Surface Area,HDI,Continent
0,Canada,35.467,1785387.0,9984670.0,0.913,America
1,France,63.951,2833687.0,640679.0,0.888,Europe
2,Germany,80.94,3874437.0,357114.0,0.916,Europe
3,Italy,60.665,2167744.0,301336.0,0.873,Europe
4,Japan,127.061,4602367.0,377930.0,0.891,Asia
5,United Kingdom,64.511,2950039.0,242495.0,0.907,Europe
6,United States,318.523,17348075.0,9525067.0,0.915,America


In [44]:
df.set_index('Population')

Unnamed: 0_level_0,GDP,Surface Area,HDI,Continent
Population,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
35.467,1785387.0,9984670.0,0.913,America
63.951,2833687.0,640679.0,0.888,Europe
80.94,3874437.0,357114.0,0.916,Europe
60.665,2167744.0,301336.0,0.873,Europe
127.061,4602367.0,377930.0,0.891,Asia
64.511,2950039.0,242495.0,0.907,Europe
318.523,17348075.0,9525067.0,0.915,America


In [45]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387.0,9984670.0,0.913,America
France,63.951,2833687.0,640679.0,0.888,Europe
Germany,80.94,3874437.0,357114.0,0.916,Europe
Italy,60.665,2167744.0,301336.0,0.873,Europe
Japan,127.061,4602367.0,377930.0,0.891,Asia
United Kingdom,64.511,2950039.0,242495.0,0.907,Europe
United States,318.523,17348075.0,9525067.0,0.915,America


## **Creating New Columns From Existing Columns**

Calculating the `"GDP Per Capita"` as part of the `Countries` analysis, which is given by the formula `GDP/Population`.

In [46]:
df['GDP Per Capita'] = df['GDP'] / df['Population']
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,GDP Per Capita
Canada,35.467,1785387.0,9984670.0,0.913,America,50339.385908
France,63.951,2833687.0,640679.0,0.888,Europe,44310.284437
Germany,80.94,3874437.0,357114.0,0.916,Europe,47868.013343
Italy,60.665,2167744.0,301336.0,0.873,Europe,35733.025633
Japan,127.061,4602367.0,377930.0,0.891,Asia,36221.712406
United Kingdom,64.511,2950039.0,242495.0,0.907,Europe,45729.239975
United States,318.523,17348075.0,9525067.0,0.915,America,54464.12033


## **Statistical Information About The DataFrame**

In [47]:
df.head() # First 5 Entries Of The DataFrame

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,GDP Per Capita
Canada,35.467,1785387.0,9984670.0,0.913,America,50339.385908
France,63.951,2833687.0,640679.0,0.888,Europe,44310.284437
Germany,80.94,3874437.0,357114.0,0.916,Europe,47868.013343
Italy,60.665,2167744.0,301336.0,0.873,Europe,35733.025633
Japan,127.061,4602367.0,377930.0,0.891,Asia,36221.712406


In [48]:
df.describe() # Statistical Summary Of The DataFrame

Unnamed: 0,Population,GDP,Surface Area,HDI,GDP Per Capita
count,7.0,7.0,7.0,7.0,7.0
mean,107.302571,5080248.0,3061327.0,0.900429,44952.254576
std,97.24997,5494020.0,4576187.0,0.016592,6954.983875
min,35.467,1785387.0,242495.0,0.873,35733.025633
25%,62.308,2500716.0,329225.0,0.8895,40265.998421
50%,64.511,2950039.0,377930.0,0.907,45729.239975
75%,104.0005,4238402.0,5082873.0,0.914,49103.699626
max,318.523,17348080.0,9984670.0,0.916,54464.12033


In [49]:
# Columnar Analysis Of Data - Population 
population = df['Population']
print(population.min(), population.max()) # In Millions
print(population.sum())
print(population.sum() / len(population)) # Mean Calculated Manually
print(population.mean()) # Mean Using The .mean() Function
print(population.std()) # Standard Deviation
print(population.median()) # Median Of The Population Column/Attribute

35.467 318.523
751.118
107.30257142857144
107.30257142857144
97.24996987121581
64.511


In [50]:
population.describe() # Summary Statistics Of Population Attribute

count      7.000000
mean     107.302571
std       97.249970
min       35.467000
25%       62.308000
50%       64.511000
75%      104.000500
max      318.523000
Name: Population, dtype: float64

In [51]:
# Quantiles -> Splits The Data Into Equal Parts / Equal Divisions Of Data Points / Equally Sized Groups
# Example: 50% Quantile => Median, Since 50% Of The Data Points Are Above It, And 50% Below.
# 25% Quantile For Example => 25% Of The Data Points Are Less Than It
# Self-Reminder: 0.25 Quantile Is Also Called 25th Percentile In Practice

print(population.quantile(.25), "\n")
print(population.quantile([.2, .4, .6, .8, 1]))
print(population.max())

62.308 

0.2     61.3222
0.4     64.1750
0.6     74.3684
0.8    117.8368
1.0    318.5230
Name: Population, dtype: float64
318.523
