Pandas
Pandas is a Python tool that helps you work with data easily — like how Excel works with tables.

It lets you:
	•	Read data from files (like CSV or Excel)
	•	Look at it in a table format
	•	Clean it if it’s messy
	•	Sort, filter, and analyze it
	•	Save the clean data again

Pandas is used a lot in data analysis, data science, and machine learning because most real-world data comes in table form — with rows and columns — and Pandas makes working with that kind of data super simple and powerful.



Pandas gives you 2 main tools:
	1.	Series → a single column (like a list with labels)
	2.	DataFrame → a full table (like an Excel sheet)
	•	It works on top of NumPy, meaning it uses NumPy under the hood to do fast calculations but adds labels (column names and row numbers) so it’s easier to understand and work with.
	•	It’s made for structured data (tables), which is the kind of data you usually get in the real world: CSVs, spreadsheets, databases, survey results, etc.

Why Pandas is Important
	•	It’s the starting point for any data science or machine learning project
	•	It helps you explore and clean data before you can build charts or train models
	•	It’s much easier to use than raw Python or NumPy when working with tabular data


🧮 NumPy (Numerical Python)

NumPy is a Python library that focuses on numerical operations and high-performance computations using arrays.
	•	It works with n-dimensional arrays (like lists of numbers, grids, or even 3D arrays).
	•	It’s mainly used for mathematical operations, linear algebra, matrix multiplications, and scientific calculations.
	•	It’s extremely fast because it uses optimized C code underneath.
	•	Data in NumPy doesn’t have column or row labels — just index positions (like array[0][1]).
	•	It’s perfect when you need to work with large sets of numbers or mathematical modeling, such as building machine learning algorithms from scratch.
	•	However, for real-world data like customer information or CSV files, NumPy alone is not very intuitive.

⸻

🐼 Pandas (Python Data Analysis Library)

Pandas is a library built on top of NumPy that makes working with structured/tabular data much easier. Behind the scenes numpy is the one that is working as a engine for most of the cases. Pandas can be considered as a dashboard for now.
	•	It introduces two main data structures: Series (1D labeled array) and DataFrame (2D labeled table).
	•	With Pandas, you can easily read data from files like CSVs, Excel sheets, and databases.
	•	It allows you to filter, sort, group, and clean data using human-friendly row and column labels.
	•	It’s designed to make real-world data handling simple and efficient — especially when data is messy or unstructured.
	•	While it’s not as fast at pure number crunching as NumPy, it makes data analysis workflows far more convenient.
	•	In most data science projects, Pandas is used first to clean and explore the data, and then NumPy may be used underneath for the calculations.


Pandas is built using NumPy (for data handling and speed), Cython (for performance), and Python (for structure), and it wraps them into a simple, powerful toolkit for working with structured data.

In [1]:
import pandas as pd 
import numpy as np
#You do not need to import NumPy just because Pandas is built on top of it.
#You only import NumPy when you plan to use NumPy directly in your own code.

Analyzing the Group of seven which is political formed by Canada, France, Germany,Italy, Japan , United Kingdom and United States . We will start analyzing the population and we will be initiating it with pandas.Series Object.

Data Structue refer to the way of storing a data in a computer so that it can be used effeciently.
Pandas has two types of data structures. 
1. pandas.Series
2. pandas.DataFrame

Think of data structures like containers:
	•	A Series is like a single column of labeled data (e.g., a list of countries with their populations).
	•	A DataFrame is like an Excel sheet — a table with rows and columns.

In [2]:
# Population in Millions
g7_population = pd.Series([35.467, 63.941, 80.940, 60.665, 127.061, 64.511, 318.523])

pd.Series() is a function providecd by the Pandas Library that stores the data creating a Series Object that is one dimensional array. Its like a dictionary or Excel column.

In [3]:
g7_population

0     35.467
1     63.941
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
dtype: float64

In [4]:
g7_population.name = "Group Seven Population in Millions"

In [5]:
g7_population
#We can add a name to the pandas series

0     35.467
1     63.941
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: Group Seven Population in Millions, dtype: float64

Series are pretty similar to numpy arrays.

In [6]:
g7_population.dtype

dtype('float64')

In [7]:
g7_population.values
#This shows the underlying datas in a Series

array([ 35.467,  63.941,  80.94 ,  60.665, 127.061,  64.511, 318.523])

In [8]:
type(g7_population.values)
#By this we can conclude that it is indeed a numpy array. So pandas series is a numpy array

numpy.ndarray

In [9]:
#We can access its elements similar to extracting the elements in a python list
g7_population[0]

np.float64(35.467)

In [10]:
g7_population[1]

np.float64(63.941)

In [11]:
g7_population.index

RangeIndex(start=0, stop=7, step=1)

We can see that a pandas series and a list both have their respective index. Index are metioned for the pandas if we look at them but for the list they are not mentioned but its there. The main difference is that the index of the pandas series are changable. We can add up our own index to it.

These index are changable for a reason and that is what seperates a pandas library.Pandas allows changeable indexes so that data can be labeled, aligned, and analyzed in a smart and human-friendly way — which is essential for real-world data science and analytics.

In [12]:
g7_population.index = [
    "Canada",
    "France",
    "Germany",
    "Italy",
    "Japan",
    "United Kingdom",
    "United States",
]

In [13]:
g7_population
#This change of index helps us in reading the data in a more interactive way
#So from now on we wont be labeling them using a certain index number , instead we will be using their respective meaningful names so that we can easily play with the data

Canada             35.467
France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Group Seven Population in Millions, dtype: float64

Now from the above data we can see that it looks similar to a dictionary that is a key value pair rather than a list with added features for data analysis. So, with that we can conclude that we can create series out of a dictionary.

Key differences between a dictionary and pandas series

Label-Based Access

A Python dict allows access to values using keys. For example, my_dict["Canada"] returns the value associated with the key “Canada”.

Similarly, a pandas.Series allows access to values using custom index labels. For example, my_series["Canada"] returns the corresponding value from the Series.

⸻

Order Preservation

In Python, dictionaries preserve insertion order only from version 3.7 onwards. In earlier versions, the order of keys was not guaranteed.

Pandas Series always preserves the order of elements. When you define values and labels in a specific order, that order is maintained throughout operations unless explicitly changed.

⸻

Vectorized Operations

Dictionaries do not support vectorized operations. You cannot perform mathematical operations on all values of a dictionary at once. You would need to loop through or use comprehensions.

Pandas Series support vectorized operations. You can perform operations like addition, subtraction, filtering, or applying mathematical functions directly on the entire Series without using loops.

⸻

Support for Missing Values (NaN)

Dictionaries have no built-in support for representing or handling missing values. If a key is absent, accessing it raises a KeyError.

Pandas Series can contain missing values using NaN (Not a Number), which makes it useful for real-world datasets that may have incomplete or missing entries.

⸻

Underlying Data Type

A Python dictionary is a built-in data type and is part of the core Python language. It does not rely on external libraries.

A Pandas Series is built on top of NumPy arrays. This gives it high performance and access to numerical operations that are not available in basic Python structures.


Converting dictionaries or JSON into Pandas Series is a common step in data ingestion and preparation, enabling the analyst to move from raw data to a structured format that’s easy to analyze and manipulate.

Ways to create a pandas series 

In [14]:
#1
g7_population = pd.Series({
    "Canada": 35.467,
    "France": 63.941,
    "Germany": 80.940,
    "Italy": 60.665,
    "Japan": 127.061,
    "United Kingdom": 64.511,
    "United States": 318.523
})

print(g7_population)

Canada             35.467
France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64


In [15]:
#2
g7_population = pd.Series(
    [35.467, 63.941, 80.940, 60.665, 127.061, 64.511, 318.523],
    index=["Canada", "France", "Germany", "Italy", "Japan", "United Kingdom", "United States"]
)

print(g7_population)

Canada             35.467
France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64


In [16]:
pd.Series(g7_population, index = ["France", "Germany", "Italy", "Spain"])
#This line is:
	# •	Creating a subset of the original Series,
	# •	Reordering the entries,
	# •	Filling NaN for any index label not present in the original (like "Spain" in this case).

France     63.941
Germany    80.940
Italy      60.665
Spain         NaN
dtype: float64

Above use of the given function	
    •	pd.Series(original_series, index=[...]) creates a new Series.
	•	It is not assigned unless you do so manually.
	•	It’s useful for selecting, reordering, or introducing missing entries in a structured way.

Indexing 

In [17]:
g7_population

Canada             35.467
France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64

In [18]:
#Accessing elements of different index in a panda series using key as a index
g7_population["Canada"]

np.float64(35.467)

In [19]:
g7_population['Japan']

np.float64(127.061)

In [20]:
g7_population[0]
#can be accesssed using the label or number index as well as the panda series are always in order

  g7_population[0]


np.float64(35.467)

In [21]:
#iloc stands for integer-location based indexing.
g7_population.iloc[0]

np.float64(35.467)

In [22]:
g7_population.iloc[0:2]

Canada    35.467
France    63.941
dtype: float64

In [23]:
g7_population.iloc[0]

np.float64(35.467)

In [24]:
g7_population.iloc[-1]

np.float64(318.523)

In [25]:
data = {
    "Country": ["Canada", "France", "Japan"],
    "Population": [35.5, 66.9, 127.3]
}

df = pd.DataFrame(data)

print(df.iloc[1])    
print(df.iloc[1, 0])    
print(df.iloc[:, 1])  

Country       France
Population      66.9
Name: 1, dtype: object
France
0     35.5
1     66.9
2    127.3
Name: Population, dtype: float64


	•	df.iloc[1] → Returns the entire second row of the DataFrame as a Series.
	•	df.iloc[1, 0] → Returns the value in the second row and first column (i.e., “France”).
	•	df.iloc[:, 1] → Returns the entire second column (Population) as a Series from the overall row. ":" means all the row from the start to the end.

In [31]:
g7_population["Canada": "Japan"]
# Gives the data from Canada till Japan and its inclusive in case of series if you do indexing which is different from that of list 

Canada      35.467
France      63.941
Germany     80.940
Italy       60.665
Japan      127.061
dtype: float64

Conditional Selection using booleans
Its very similar to that of numpys 

In [33]:
g7_population

Canada             35.467
France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64

In [None]:
g7_population > 70
#Filtering the series using the boolean condition so that the elements that meets the condition only gets filtered out
    # The twist here is that instead of filtering only the elements meeting the condition it passes the series with it value in the boolean     

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
dtype: bool

In [40]:
g7_population[g7_population > 70]
#Here iisntead of passing the values with boolean the elements satisfying the codition it only printed out
#Its more like indexing using the boolean conditions

Germany           80.940
Japan            127.061
United States    318.523
dtype: float64

In [None]:
g7_population.mean()

np.float64(107.30114285714286)

In [44]:
g7_population[g7_population > g7_population.mean()]   
#This shows the data that satisfies the condition that is the datas that has the population greater than the mean

Japan            127.061
United States    318.523
dtype: float64

In [None]:
g7_population.std()
#Std stands for the standard deviation

97.25071289957825

Standard deviation is a measure of how spread out the data is from the average (mean). If the standard deviation is low, it means most of the data points are close to the mean, indicating the data is consistent and stable. On the other hand, if the standard deviation is high, it means the data points are widely spread from the mean, showing inconsistency or variability in the data.

In data science and machine learning, standard deviation is very useful. For example, if you run the same model multiple times and the accuracy varies a lot, the standard deviation of the accuracy will be high. This means the model’s performance is unstable and unreliable. But if the accuracy stays close to the same value each time, the standard deviation will be low, indicating the model is performing consistently and is more trustworthy.

In [48]:
g7_population[(g7_population > g7_population.mean() - g7_population.std() /2 )| (g7_population  > g7_population.mean() + g7_population.std() / 2 )]


France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64

The above formula is trying to find out the series whose population is greater than the average and greather than a certain below the average. In the above code any population above (mean - half of standard deviation) or greater than (mean + half of the standard deviation) are printed out. This means it will just give anything above the threshold (mean- half of standard deviation)

In [50]:
g7_population * 1_000_000
#The use of this "_" is just for the readibility of big numbers

Canada             35467000.0
France             63941000.0
Germany            80940000.0
Italy              60665000.0
Japan             127061000.0
United Kingdom     64511000.0
United States     318523000.0
dtype: float64

In [51]:
np.log(g7_population)

Canada            3.568603
France            4.157961
Germany           4.393708
Italy             4.105367
Japan             4.844667
United Kingdom    4.166836
United States     5.763695
dtype: float64

The use of the log is to reduce the skewness of the data and bringing it closer to the normal distribution.

Skewness refers to the asymmetry in a dataset’s distribution. When data is right-skewed (positively skewed), it has a long tail on the right, meaning a few large values pull the average higher—common in things like income data. In left-skewed (negatively skewed) data, the tail is on the left, and a few low values pull the average lower. A dataset with no skew is symmetrical, meaning its values are evenly spread around the average.

A normal distribution, also known as a bell curve or Gaussian distribution, is a special type of symmetrical data distribution where most values cluster around the mean, and fewer values appear as you move away from it. In this distribution, the mean, median, and mode are all the same. Many natural phenomena, like height, weight, and standardized test scores, tend to follow this pattern.

These concepts are widely used in statistics, machine learning, economics, and healthcare. Many algorithms and statistical tests assume that the data is normally distributed. When data is highly skewed, transformations like logarithms are used to reduce skewness and bring the data closer to a normal distribution, which helps improve model accuracy and interpretability.

If 80% of your sales are high and only 20% are low, and those few low sales are much lower than the rest, your data becomes left-skewed (not right-skewed). Why? Because the long tail of the data is on the left side — those very low sales pull the distribution in that direction.

Conversely, if most of your sales are low, but a few sales are extremely high, it becomes right-skewed, with a long tail on the right (the high sales stretch the curve).

A normal distribution would mean that most sales are around the average, and both low and high sales are relatively balanced — say, a 60/40 split, or even more balanced like 50/50, with symmetry around the mean.

What about Standard Deviation?

Standard deviation does not measure skewness, but instead measures spread — how far your data points are from the mean. In a normal distribution, standard deviation tells you things like:
	•	~68% of values lie within ±1 standard deviation from the mean
	•	~95% within ±2
	•	~99.7% within ±3

In skewed data, standard deviation can be misleading, because the mean itself is pulled by outliers. That’s why we often transform skewed data (e.g., with a log transform) to bring it closer to normal, making standard deviation and other statistical tools more meaningful.

The tail is the key to skewness:
	•	If the tail is on the right side (towards the higher values), the data is right-skewed (positively skewed).
	•	This means most data points are low or moderate, but there are some very high values pulling the tail out to the right.
	•	If the tail is on the left side (towards the lower values), the data is left-skewed (negatively skewed).
	•	This means most data points are high or moderate, but there are some very low values pulling the tail out to the left.



The extreme outliers—those data points that are far away from the average—play the biggest role in determining skewness.

In [None]:
g7_population["Canada":"Japan"].mean()
#This is the use of slicing to calculate the mean among them

np.float64(73.6148)

Boolean Arrays

In [55]:
g7_population


Canada             35.467
France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64

In [56]:
g7_population > 80

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
dtype: bool

In [58]:
g7_population[g7_population > 80]

Germany           80.940
Japan            127.061
United States    318.523
dtype: float64

In [60]:
g7_population[(g7_population > 80) | (g7_population <40)]
# The datas that satisfies any one of them is filtered out 

Canada            35.467
Germany           80.940
Japan            127.061
United States    318.523
dtype: float64

In [61]:
g7_population[(g7_population > 80) & (g7_population < 40)]
#The datas that satisfies both the conditions are filtered out


Series([], dtype: float64)

Modifying a series

In [62]:
g7_population["Canada"] = 100

In [63]:
g7_population

Canada            100.000
France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64

In [None]:
g7_population.iloc[-1] = 200
#Using iloc to access the data using the ordering

In [67]:
g7_population

Canada            100.000
France             63.941
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     200.000
dtype: float64

In [None]:
g7_population[(g7_population <= 100)] = 20
#Anything that satisfies the condition are changed accordingly using the condition

In [70]:
g7_population

Canada             20.000
France             20.000
Germany            20.000
Italy              20.000
Japan             127.061
United Kingdom     20.000
United States     200.000
dtype: float64

All of the above procedures are very important in the data cleaning.

Dataframes