Pandas
Pandas is a Python tool that helps you work with data easily — like how Excel works with tables.

It lets you:
	•	Read data from files (like CSV or Excel)
	•	Look at it in a table format
	•	Clean it if it’s messy
	•	Sort, filter, and analyze it
	•	Save the clean data again

Pandas is used a lot in data analysis, data science, and machine learning because most real-world data comes in table form — with rows and columns — and Pandas makes working with that kind of data super simple and powerful.



Pandas gives you 2 main tools:
	1.	Series → a single column (like a list with labels)
	2.	DataFrame → a full table (like an Excel sheet)
	•	It works on top of NumPy, meaning it uses NumPy under the hood to do fast calculations but adds labels (column names and row numbers) so it’s easier to understand and work with.
	•	It’s made for structured data (tables), which is the kind of data you usually get in the real world: CSVs, spreadsheets, databases, survey results, etc.

Why Pandas is Important
	•	It’s the starting point for any data science or machine learning project
	•	It helps you explore and clean data before you can build charts or train models
	•	It’s much easier to use than raw Python or NumPy when working with tabular data


🧮 NumPy (Numerical Python)

NumPy is a Python library that focuses on numerical operations and high-performance computations using arrays.
	•	It works with n-dimensional arrays (like lists of numbers, grids, or even 3D arrays).
	•	It’s mainly used for mathematical operations, linear algebra, matrix multiplications, and scientific calculations.
	•	It’s extremely fast because it uses optimized C code underneath.
	•	Data in NumPy doesn’t have column or row labels — just index positions (like array[0][1]).
	•	It’s perfect when you need to work with large sets of numbers or mathematical modeling, such as building machine learning algorithms from scratch.
	•	However, for real-world data like customer information or CSV files, NumPy alone is not very intuitive.

⸻

🐼 Pandas (Python Data Analysis Library)

Pandas is a library built on top of NumPy that makes working with structured/tabular data much easier. Behind the scenes numpy is the one that is working as a engine for most of the cases. Pandas can be considered as a dashboard for now.
	•	It introduces two main data structures: Series (1D labeled array) and DataFrame (2D labeled table).
	•	With Pandas, you can easily read data from files like CSVs, Excel sheets, and databases.
	•	It allows you to filter, sort, group, and clean data using human-friendly row and column labels.
	•	It’s designed to make real-world data handling simple and efficient — especially when data is messy or unstructured.
	•	While it’s not as fast at pure number crunching as NumPy, it makes data analysis workflows far more convenient.
	•	In most data science projects, Pandas is used first to clean and explore the data, and then NumPy may be used underneath for the calculations.


Pandas is built using NumPy (for data handling and speed), Cython (for performance), and Python (for structure), and it wraps them into a simple, powerful toolkit for working with structured data.

In [244]:
import pandas as pd 
import numpy as np
#You do not need to import NumPy just because Pandas is built on top of it.
#You only import NumPy when you plan to use NumPy directly in your own code.

Analyzing the Group of seven which is political formed by Canada, France, Germany,Italy, Japan , United Kingdom and United States . We will start analyzing the population and we will be initiating it with pandas.Series Object.

Data Structue refer to the way of storing a data in a computer so that it can be used effeciently.
Pandas has two types of data structures. 
1. pandas.Series
2. pandas.DataFrame

Think of data structures like containers:
	•	A Series is like a single column of labeled data (e.g., a list of countries with their populations).
	•	A DataFrame is like an Excel sheet — a table with rows and columns.

In [245]:
# Population in Millions
g7_population = pd.Series([35.467, 63.941, 80.940, 60.665, 127.061, 64.511, 318.523])

pd.Series() is a function providecd by the Pandas Library that stores the data creating a Series Object that is one dimensional array. Its like a dictionary or Excel column.

In [246]:
g7_population

0     35.467
1     63.941
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
dtype: float64

In [247]:
g7_population.name = "Group Seven Population in Millions"

In [248]:
g7_population
#We can add a name to the pandas series

0     35.467
1     63.941
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: Group Seven Population in Millions, dtype: float64

Series are pretty similar to numpy arrays.

In [249]:
g7_population.dtype

dtype('float64')

In [250]:
g7_population.values
#This shows the underlying datas in a Series

array([ 35.467,  63.941,  80.94 ,  60.665, 127.061,  64.511, 318.523])

In [251]:
type(g7_population.values)
#By this we can conclude that it is indeed a numpy array. So pandas series is a numpy array

numpy.ndarray

In [252]:
#We can access its elements similar to extracting the elements in a python list
g7_population[0]

np.float64(35.467)

In [253]:
g7_population[1]

np.float64(63.941)

In [254]:
g7_population.index

RangeIndex(start=0, stop=7, step=1)

We can see that a pandas series and a list both have their respective index. Index are metioned for the pandas if we look at them but for the list they are not mentioned but its there. The main difference is that the index of the pandas series are changable. We can add up our own index to it.

These index are changable for a reason and that is what seperates a pandas library.Pandas allows changeable indexes so that data can be labeled, aligned, and analyzed in a smart and human-friendly way — which is essential for real-world data science and analytics.

In [255]:
g7_population.index = [
    "Canada",
    "France",
    "Germany",
    "Italy",
    "Japan",
    "United Kingdom",
    "United States",
]

In [256]:
g7_population
#This change of index helps us in reading the data in a more interactive way
#So from now on we wont be labeling them using a certain index number , instead we will be using their respective meaningful names so that we can easily play with the data

Canada             35.467
France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Group Seven Population in Millions, dtype: float64

Now from the above data we can see that it looks similar to a dictionary that is a key value pair rather than a list with added features for data analysis. So, with that we can conclude that we can create series out of a dictionary.

Key differences between a dictionary and pandas series

Label-Based Access

A Python dict allows access to values using keys. For example, my_dict["Canada"] returns the value associated with the key “Canada”.

Similarly, a pandas.Series allows access to values using custom index labels. For example, my_series["Canada"] returns the corresponding value from the Series.

⸻

Order Preservation

In Python, dictionaries preserve insertion order only from version 3.7 onwards. In earlier versions, the order of keys was not guaranteed.

Pandas Series always preserves the order of elements. When you define values and labels in a specific order, that order is maintained throughout operations unless explicitly changed.

⸻

Vectorized Operations

Dictionaries do not support vectorized operations. You cannot perform mathematical operations on all values of a dictionary at once. You would need to loop through or use comprehensions.

Pandas Series support vectorized operations. You can perform operations like addition, subtraction, filtering, or applying mathematical functions directly on the entire Series without using loops.

⸻

Support for Missing Values (NaN)

Dictionaries have no built-in support for representing or handling missing values. If a key is absent, accessing it raises a KeyError.

Pandas Series can contain missing values using NaN (Not a Number), which makes it useful for real-world datasets that may have incomplete or missing entries.

⸻

Underlying Data Type

A Python dictionary is a built-in data type and is part of the core Python language. It does not rely on external libraries.

A Pandas Series is built on top of NumPy arrays. This gives it high performance and access to numerical operations that are not available in basic Python structures.


Converting dictionaries or JSON into Pandas Series is a common step in data ingestion and preparation, enabling the analyst to move from raw data to a structured format that’s easy to analyze and manipulate.

Ways to create a pandas series 

In [257]:
#1
g7_population = pd.Series({
    "Canada": 35.467,
    "France": 63.941,
    "Germany": 80.940,
    "Italy": 60.665,
    "Japan": 127.061,
    "United Kingdom": 64.511,
    "United States": 318.523
})

print(g7_population)

Canada             35.467
France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64


In [258]:
#2
g7_population = pd.Series(
    [35.467, 63.941, 80.940, 60.665, 127.061, 64.511, 318.523],
    index=["Canada", "France", "Germany", "Italy", "Japan", "United Kingdom", "United States"]
)

print(g7_population)

Canada             35.467
France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64


In [259]:
pd.Series(g7_population, index = ["France", "Germany", "Italy", "Spain"])
#This line is:
	# •	Creating a subset of the original Series,
	# •	Reordering the entries,
	# •	Filling NaN for any index label not present in the original (like "Spain" in this case).

France     63.941
Germany    80.940
Italy      60.665
Spain         NaN
dtype: float64

Above use of the given function	
    •	pd.Series(original_series, index=[...]) creates a new Series.
	•	It is not assigned unless you do so manually.
	•	It’s useful for selecting, reordering, or introducing missing entries in a structured way.

Indexing 

In [260]:
g7_population

Canada             35.467
France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64

In [261]:
#Accessing elements of different index in a panda series using key as a index
g7_population["Canada"]

np.float64(35.467)

In [262]:
g7_population['Japan']

np.float64(127.061)

In [263]:
g7_population[0]
#can be accesssed using the label or number index as well as the panda series are always in order

  g7_population[0]


np.float64(35.467)

In [264]:
#iloc stands for integer-location based indexing.
g7_population.iloc[0]

np.float64(35.467)

In [265]:
g7_population.iloc[0:2]

Canada    35.467
France    63.941
dtype: float64

In [266]:
g7_population.iloc[0]

np.float64(35.467)

In [267]:
g7_population.iloc[-1]

np.float64(318.523)

In [268]:
data = {
    "Country": ["Canada", "France", "Japan"],
    "Population": [35.5, 66.9, 127.3]
}

df = pd.DataFrame(data)

print(df.iloc[1])    
print(df.iloc[1, 0])    
print(df.iloc[:, 1])  

Country       France
Population      66.9
Name: 1, dtype: object
France
0     35.5
1     66.9
2    127.3
Name: Population, dtype: float64


	•	df.iloc[1] → Returns the entire second row of the DataFrame as a Series.
	•	df.iloc[1, 0] → Returns the value in the second row and first column (i.e., “France”).
	•	df.iloc[:, 1] → Returns the entire second column (Population) as a Series from the overall row. ":" means all the row from the start to the end.

In [269]:
g7_population["Canada": "Japan"]
# Gives the data from Canada till Japan and its inclusive in case of series if you do indexing which is different from that of list 

Canada      35.467
France      63.941
Germany     80.940
Italy       60.665
Japan      127.061
dtype: float64

Conditional Selection using booleans
Its very similar to that of numpys 

In [270]:
g7_population

Canada             35.467
France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64

In [271]:
g7_population > 70
#Filtering the series using the boolean condition so that the elements that meets the condition only gets filtered out
    # The twist here is that instead of filtering only the elements meeting the condition it passes the series with it value in the boolean     

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
dtype: bool

In [272]:
g7_population[g7_population > 70]
#Here iisntead of passing the values with boolean the elements satisfying the codition it only printed out
#Its more like indexing using the boolean conditions

Germany           80.940
Japan            127.061
United States    318.523
dtype: float64

In [273]:
g7_population.mean()

np.float64(107.30114285714286)

In [274]:
g7_population[g7_population > g7_population.mean()]   
#This shows the data that satisfies the condition that is the datas that has the population greater than the mean

Japan            127.061
United States    318.523
dtype: float64

In [275]:
g7_population.std()
#Std stands for the standard deviation

97.25071289957825

Standard deviation is a measure of how spread out the data is from the average (mean). If the standard deviation is low, it means most of the data points are close to the mean, indicating the data is consistent and stable. On the other hand, if the standard deviation is high, it means the data points are widely spread from the mean, showing inconsistency or variability in the data.

In data science and machine learning, standard deviation is very useful. For example, if you run the same model multiple times and the accuracy varies a lot, the standard deviation of the accuracy will be high. This means the model’s performance is unstable and unreliable. But if the accuracy stays close to the same value each time, the standard deviation will be low, indicating the model is performing consistently and is more trustworthy.

In [276]:
g7_population[(g7_population > g7_population.mean() - g7_population.std() /2 )| (g7_population  > g7_population.mean() + g7_population.std() / 2 )]


France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64

The above formula is trying to find out the series whose population is greater than the average and greather than a certain below the average. In the above code any population above (mean - half of standard deviation) or greater than (mean + half of the standard deviation) are printed out. This means it will just give anything above the threshold (mean- half of standard deviation)

In [277]:
g7_population * 1_000_000
#The use of this "_" is just for the readibility of big numbers

Canada             35467000.0
France             63941000.0
Germany            80940000.0
Italy              60665000.0
Japan             127061000.0
United Kingdom     64511000.0
United States     318523000.0
dtype: float64

In [278]:
np.log(g7_population)

Canada            3.568603
France            4.157961
Germany           4.393708
Italy             4.105367
Japan             4.844667
United Kingdom    4.166836
United States     5.763695
dtype: float64

The use of the log is to reduce the skewness of the data and bringing it closer to the normal distribution.

Skewness refers to the asymmetry in a dataset’s distribution. When data is right-skewed (positively skewed), it has a long tail on the right, meaning a few large values pull the average higher—common in things like income data. In left-skewed (negatively skewed) data, the tail is on the left, and a few low values pull the average lower. A dataset with no skew is symmetrical, meaning its values are evenly spread around the average.

A normal distribution, also known as a bell curve or Gaussian distribution, is a special type of symmetrical data distribution where most values cluster around the mean, and fewer values appear as you move away from it. In this distribution, the mean, median, and mode are all the same. Many natural phenomena, like height, weight, and standardized test scores, tend to follow this pattern.

These concepts are widely used in statistics, machine learning, economics, and healthcare. Many algorithms and statistical tests assume that the data is normally distributed. When data is highly skewed, transformations like logarithms are used to reduce skewness and bring the data closer to a normal distribution, which helps improve model accuracy and interpretability.

If 80% of your sales are high and only 20% are low, and those few low sales are much lower than the rest, your data becomes left-skewed (not right-skewed). Why? Because the long tail of the data is on the left side — those very low sales pull the distribution in that direction.

Conversely, if most of your sales are low, but a few sales are extremely high, it becomes right-skewed, with a long tail on the right (the high sales stretch the curve).

A normal distribution would mean that most sales are around the average, and both low and high sales are relatively balanced — say, a 60/40 split, or even more balanced like 50/50, with symmetry around the mean.

What about Standard Deviation?

Standard deviation does not measure skewness, but instead measures spread — how far your data points are from the mean. In a normal distribution, standard deviation tells you things like:
	•	~68% of values lie within ±1 standard deviation from the mean
	•	~95% within ±2
	•	~99.7% within ±3

In skewed data, standard deviation can be misleading, because the mean itself is pulled by outliers. That’s why we often transform skewed data (e.g., with a log transform) to bring it closer to normal, making standard deviation and other statistical tools more meaningful.

The tail is the key to skewness:
	•	If the tail is on the right side (towards the higher values), the data is right-skewed (positively skewed).
	•	This means most data points are low or moderate, but there are some very high values pulling the tail out to the right.
	•	If the tail is on the left side (towards the lower values), the data is left-skewed (negatively skewed).
	•	This means most data points are high or moderate, but there are some very low values pulling the tail out to the left.



The extreme outliers—those data points that are far away from the average—play the biggest role in determining skewness.

In [279]:
g7_population["Canada":"Japan"].mean()
#This is the use of slicing to calculate the mean among them

np.float64(73.6148)

Boolean Arrays

In [280]:
g7_population


Canada             35.467
France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64

In [281]:
g7_population > 80

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
dtype: bool

In [282]:
g7_population[g7_population > 80]

Germany           80.940
Japan            127.061
United States    318.523
dtype: float64

In [283]:
g7_population[(g7_population > 80) | (g7_population <40)]
# The datas that satisfies any one of them is filtered out 

Canada            35.467
Germany           80.940
Japan            127.061
United States    318.523
dtype: float64

In [284]:
g7_population[(g7_population > 80) & (g7_population < 40)]
#The datas that satisfies both the conditions are filtered out


Series([], dtype: float64)

Modifying a series

In [285]:
g7_population["Canada"] = 100

In [286]:
g7_population

Canada            100.000
France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64

In [287]:
g7_population.iloc[-1] = 200
#Using iloc to access the data using the ordering

In [288]:
g7_population

Canada            100.000
France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     200.000
dtype: float64

In [289]:
g7_population[(g7_population <= 100)] = 20
#Anything that satisfies the condition are changed accordingly using the condition

In [290]:
g7_population

Canada             20.000
France             20.000
Germany            20.000
Italy              20.000
Japan             127.061
United Kingdom     20.000
United States     200.000
dtype: float64

All of the above procedures are very important in the data cleaning.

Dataframes

A DataFrame in pandas is a two-dimensional, tabular data structure that resembles a table or spreadsheet, where data is organized in rows and columns. Each column in a DataFrame can hold different types of data (e.g., integers, floats, strings), and each column is essentially a Series, which is a one-dimensional labeled array. The key difference is that a Series represents a single column of data with an associated index, while a DataFrame is a collection of multiple Series sharing a common index, allowing for more complex data manipulation and analysis. In simpler terms, a Series is like a single column, and a DataFrame is like a full table made up of multiple such columns.

In [291]:
import pandas as pd

g7_data = {
    'Population': [37742154, 65273511, 83783942, 60461826, 126476461, 67886011, 331002651],
    'GDP': [1736.4, 2715.5, 3845.6, 2001.2, 4971.8, 2829.1, 21433.2],  # in billions USD
    'Continent': ['North America', 'Europe', 'Europe', 'Europe', 'Asia', 'Europe', 'North America'],
    'HDI': [0.936, 0.901, 0.942, 0.895, 0.925, 0.929, 0.921]  # Approx HDI values
}


In [292]:
df = pd.DataFrame(g7_data)
#Since the above data is a dictionary we will be using the pandas to make it as a data frame

In [293]:
df

Unnamed: 0,Population,GDP,Continent,HDI
0,37742154,1736.4,North America,0.936
1,65273511,2715.5,Europe,0.901
2,83783942,3845.6,Europe,0.942
3,60461826,2001.2,Europe,0.895
4,126476461,4971.8,Asia,0.925
5,67886011,2829.1,Europe,0.929
6,331002651,21433.2,North America,0.921


In [294]:
df.index = ['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom', 'United States']

In [295]:
df
# Here now the country names are being used as the index 

Unnamed: 0,Population,GDP,Continent,HDI
Canada,37742154,1736.4,North America,0.936
France,65273511,2715.5,Europe,0.901
Germany,83783942,3845.6,Europe,0.942
Italy,60461826,2001.2,Europe,0.895
Japan,126476461,4971.8,Asia,0.925
United Kingdom,67886011,2829.1,Europe,0.929
United States,331002651,21433.2,North America,0.921


In [296]:
df.columns    

Index(['Population', 'GDP', 'Continent', 'HDI'], dtype='object')

In [297]:
df.index

Index(['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom',
       'United States'],
      dtype='object')

In [298]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, Canada to United States
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Population  7 non-null      int64  
 1   GDP         7 non-null      float64
 2   Continent   7 non-null      object 
 3   HDI         7 non-null      float64
dtypes: float64(2), int64(1), object(1)
memory usage: 280.0+ bytes


In [299]:
df.size

28

In [300]:
df.shape

(7, 4)

In [301]:
df.describe()

Unnamed: 0,Population,GDP,HDI
count,7.0,7.0,7.0
mean,110375200.0,5647.542857,0.921286
std,101035700.0,7047.741341,0.017423
min,37742150.0,1736.4,0.895
25%,62867670.0,2358.35,0.911
50%,67886010.0,2829.1,0.925
75%,105130200.0,4408.7,0.9325
max,331002700.0,21433.2,0.942


In [302]:
df.dtypes

Population      int64
GDP           float64
Continent      object
HDI           float64
dtype: object

In [303]:
df.dtypes.value_counts

<bound method IndexOpsMixin.value_counts of Population      int64
GDP           float64
Continent      object
HDI           float64
dtype: object>

In Pandas, .columns returns the list of column names in the DataFrame.
.index displays the row index range or labels.
.info() provides a concise summary including column data types, non-null counts, and memory usage.
.describe() offers statistical summaries such as mean, min, max, and quartiles for numeric columns.
.size gives the total number of elements in the DataFrame (rows × columns).
.shape returns a tuple representing the number of rows and columns.
.dtypes shows the data type of each column.
.dtypes.value_counts() counts how many columns exist for each data type.


Indexing, Selection and Slicing 

In [304]:
df 

Unnamed: 0,Population,GDP,Continent,HDI
Canada,37742154,1736.4,North America,0.936
France,65273511,2715.5,Europe,0.901
Germany,83783942,3845.6,Europe,0.942
Italy,60461826,2001.2,Europe,0.895
Japan,126476461,4971.8,Asia,0.925
United Kingdom,67886011,2829.1,Europe,0.929
United States,331002651,21433.2,North America,0.921


In [305]:
df.loc['Canada']
#This selects the entire row using the indes

Population         37742154
GDP                  1736.4
Continent     North America
HDI                   0.936
Name: Canada, dtype: object

In [306]:
df.iloc[0]
#Whereas the iloc selects the row using the sequential order  

Population         37742154
GDP                  1736.4
Continent     North America
HDI                   0.936
Name: Canada, dtype: object

In [307]:
df['Continent']
#Where as this is used to access the given column

Canada            North America
France                   Europe
Germany                  Europe
Italy                    Europe
Japan                      Asia
United Kingdom           Europe
United States     North America
Name: Continent, dtype: object

All the data that are being returned are series indeed.
Sometimes the problem may arise when dataframe uses a numerical index which for now the country name is being used as a index. So during such times the loc and iloc will still use the label-based indexing no matter what index is being used by the dataframe.


In [308]:
df["Population"].to_frame()
#The above function .to_frame is simply converting the series into a data frame so that we can pass this data as a 2D array if some function expects in that manner.


Unnamed: 0,Population
Canada,37742154
France,65273511
Germany,83783942
Italy,60461826
Japan,126476461
United Kingdom,67886011
United States,331002651


In [309]:
df["Population"]

Canada             37742154
France             65273511
Germany            83783942
Italy              60461826
Japan             126476461
United Kingdom     67886011
United States     331002651
Name: Population, dtype: int64

The differences here is that the first one is a data frame and second one is a series. How is it differentiated is that the first one has a column name which is the column and the country name which is the row and hence its now a 2D array. Sometimes proper shaping matter when doing different operations so to match those series are needed to be converrted to a dataframe using .to_frame()

In [310]:
df[["Population" , "GDP"]]
#df[] this is how you access a data from a data frame and to pass two columns we pass inside another bracked and selects multiple columns

Unnamed: 0,Population,GDP
Canada,37742154,1736.4
France,65273511,2715.5
Germany,83783942,3845.6
Italy,60461826,2001.2
Japan,126476461,4971.8
United Kingdom,67886011,2829.1
United States,331002651,21433.2


In [311]:
df.loc['Canada': 'Japan']
#Selecting datas using the slicing 

Unnamed: 0,Population,GDP,Continent,HDI
Canada,37742154,1736.4,North America,0.936
France,65273511,2715.5,Europe,0.901
Germany,83783942,3845.6,Europe,0.942
Italy,60461826,2001.2,Europe,0.895
Japan,126476461,4971.8,Asia,0.925


In [312]:
df.iloc[1:3,-1]
#Filtering the data using the ordering or a label

France     0.901
Germany    0.942
Name: HDI, dtype: float64

In [313]:
df.loc["Italy"]

Population    60461826
GDP             2001.2
Continent       Europe
HDI              0.895
Name: Italy, dtype: object

In [314]:
df.loc["Canada" : "Germany", "Population"]


Canada     37742154
France     65273511
Germany    83783942
Name: Population, dtype: int64

In [315]:
df.loc["Canada" : "Germany", "Population"].to_frame()

Unnamed: 0,Population
Canada,37742154
France,65273511
Germany,83783942


The one of the thing we should understand is that like in other slicing where the start: end , step were the params that a method would take. But the case is different when we use the loc and iloc instead of taking the step it is actually taking the column and filtering only the given column. loc and iloc are the method only used by the pandas. So now we can conclude that since series is a 1D array and doesnot have a column. We donot pass anything in place of a step. We can similarly pass multiple columns name to extract multiple datas as well.

In [316]:
df.loc["Canada" : "Japan", ["Population" , "GDP"]]
#Similarly since there are two columns we will have to keep it inside another bracket as well

Unnamed: 0,Population,GDP
Canada,37742154,1736.4
France,65273511,2715.5
Germany,83783942,3845.6
Italy,60461826,2001.2
Japan,126476461,4971.8


iloc works with the numeric position of the index


In [317]:
df.iloc[1]

Population    65273511
GDP             2715.5
Continent       Europe
HDI              0.901
Name: France, dtype: object

In [318]:
df.iloc[:,1]
#This gives all the rows of the mentioned column that is of the index 1

Canada             1736.4
France             2715.5
Germany            3845.6
Italy              2001.2
Japan              4971.8
United Kingdom     2829.1
United States     21433.2
Name: GDP, dtype: float64

In [319]:
df.iloc[0:2, [1,2,3]]
#Selecting multiple columns of a filtered row using the numeric indexing 


Unnamed: 0,GDP,Continent,HDI
Canada,1736.4,North America,0.936
France,2715.5,Europe,0.901


When your DataFrame uses numbers as index values, such as 0, 1, 2, and so on, it can become unclear whether you’re referring to the index label (like index 0) or the row’s position (the first row). This can lead to confusion or errors in your code. To avoid this ambiguity, it’s best to use .iloc[] when you want to select rows by their position (for example, the first, second, or third row), and .loc[] when you want to select rows by their index label (for example, the row that has an index of 0, even if it’s not the first row). This makes your code clearer and helps prevent mistakes.

Conditional Selection (Boolean Arrays)

In [320]:
df

Unnamed: 0,Population,GDP,Continent,HDI
Canada,37742154,1736.4,North America,0.936
France,65273511,2715.5,Europe,0.901
Germany,83783942,3845.6,Europe,0.942
Italy,60461826,2001.2,Europe,0.895
Japan,126476461,4971.8,Asia,0.925
United Kingdom,67886011,2829.1,Europe,0.929
United States,331002651,21433.2,North America,0.921


In [321]:
df["Population"] > 50

Canada            True
France            True
Germany           True
Italy             True
Japan             True
United Kingdom    True
United States     True
Name: Population, dtype: bool

In [322]:
df

Unnamed: 0,Population,GDP,Continent,HDI
Canada,37742154,1736.4,North America,0.936
France,65273511,2715.5,Europe,0.901
Germany,83783942,3845.6,Europe,0.942
Italy,60461826,2001.2,Europe,0.895
Japan,126476461,4971.8,Asia,0.925
United Kingdom,67886011,2829.1,Europe,0.929
United States,331002651,21433.2,North America,0.921


In [335]:
df.loc[df["Population"] < 100000000]

Unnamed: 0,Population,GDP,Continent,HDI
Canada,37742154,1736.4,North America,0.936
France,65273511,2715.5,Europe,0.901
Germany,83783942,3845.6,Europe,0.942
Italy,60461826,2001.2,Europe,0.895
United Kingdom,67886011,2829.1,Europe,0.929


In [None]:
df.loc[df["Population"] > 100000000 , ["Population","GDP"]]
#Filtration of column based on the given condition

Unnamed: 0,Population,GDP
Japan,126476461,4971.8
United States,331002651,21433.2


Dropping rows and columns using Pandas 
WHY to use .drop() in pandas:
	1.	To remove irrelevant columns that do not contribute to the analysis or model training.
	2.	To eliminate duplicate or redundant information that could cause noise in the dataset.
	3.	To simplify the dataset for better readability or visualization.
	4.	To remove sensitive or personal data for privacy or compliance purposes.
	5.	To prepare features for modeling by dropping the target column from the feature set (or vice versa).
	6.	To remove rows with invalid or corrupt data that may affect analysis.
	7.	To filter out outliers or unwanted data points based on business rules.
	8.	To clean up merged or joined datasets where some columns may be

WHEN to use .drop() in pandas:
	1.	When preparing data for visualization and only the necessary features are needed.
	2.	When reducing dimensionality to avoid the curse of dimensionality or improve model performance.
	3.	When data contains null or NaN values that you decide to remove instead of imputing.
	4.	When exporting the dataset and you want to exclude internal or helper columns.
	5.	When cleaning raw data before conducting statistical or machine learning tasks.
	6.	When columns have high correlation and you want to drop one to reduce multicollinearity.
	7.	When you’re performing feature engineering and need to remove outdated or unused features.
	8.	When data violates data integrity or business logic, and the rows/columns must be discarded.


📌 Multicollinearity — What is it?

Multicollinearity occurs when two or more independent variables (features) in a dataset are highly correlated with each other — meaning they carry similar information.

⸻

🧠 Why is multicollinearity a problem?
	1.	Reduces model interpretability: If two features are highly correlated, it’s hard to tell which one is actually influencing the target.
	2.	Unstable coefficient estimates: In models like linear regression, it can lead to inflated standard errors, making the model’s predictions unreliable.
	3.	Increases model complexity without adding much value.

High corelation relates to directly proportional. 

In [339]:
df.drop("Canada")

Unnamed: 0,Population,GDP,Continent,HDI
France,65273511,2715.5,Europe,0.901
Germany,83783942,3845.6,Europe,0.942
Italy,60461826,2001.2,Europe,0.895
Japan,126476461,4971.8,Asia,0.925
United Kingdom,67886011,2829.1,Europe,0.929
United States,331002651,21433.2,North America,0.921


In [341]:
df.drop(["Canada", "Japan"])
#If you are getting confused with the use of [] sometimes. You should know that we are passing a list as a parameter when two columns are together so we use this []

Unnamed: 0,Population,GDP,Continent,HDI
France,65273511,2715.5,Europe,0.901
Germany,83783942,3845.6,Europe,0.942
Italy,60461826,2001.2,Europe,0.895
United Kingdom,67886011,2829.1,Europe,0.929
United States,331002651,21433.2,North America,0.921


In [342]:
df.drop(columns = ["Population", "GDP"])

Unnamed: 0,Continent,HDI
Canada,North America,0.936
France,Europe,0.901
Germany,Europe,0.942
Italy,Europe,0.895
Japan,Asia,0.925
United Kingdom,Europe,0.929
United States,North America,0.921


In [344]:
df.drop(["Italy", "Canada"], axis=0)
# This axis parameter tells us whether to drop rows or columns where 0 means row and 1 means column.

Unnamed: 0,Population,GDP,Continent,HDI
France,65273511,2715.5,Europe,0.901
Germany,83783942,3845.6,Europe,0.942
Japan,126476461,4971.8,Asia,0.925
United Kingdom,67886011,2829.1,Europe,0.929
United States,331002651,21433.2,North America,0.921


In [347]:
df.drop(["Population", "HDI"], axis=1)
#When dropping either row or columns the params should be passed based on the axis otherwise it will throw error

Unnamed: 0,GDP,Continent
Canada,1736.4,North America
France,2715.5,Europe
Germany,3845.6,Europe
Italy,2001.2,Europe
Japan,4971.8,Asia
United Kingdom,2829.1,Europe
United States,21433.2,North America


The use of the axis is to mostly to remove the confusion in cases where the rows and column are same. So when both of them are same, we will have to pass axis to direct it

In [354]:
df.drop(["Population", "GDP"], axis='columns')
#We can pass the rows and columns as parameters for the axis as well.

Unnamed: 0,Continent,HDI
Canada,North America,0.936
France,Europe,0.901
Germany,Europe,0.942
Italy,Europe,0.895
Japan,Asia,0.925
United Kingdom,Europe,0.929
United States,North America,0.921


In [356]:
df.drop(["Canada", "Germany"], axis = "rows")

Unnamed: 0,Population,GDP,Continent,HDI
France,65273511,2715.5,Europe,0.901
Italy,60461826,2001.2,Europe,0.895
Japan,126476461,4971.8,Asia,0.925
United Kingdom,67886011,2829.1,Europe,0.929
United States,331002651,21433.2,North America,0.921


Broadcasting Operations