# Introduction to Pandas

---

##  Theory

Pandas is a powerful open-source Python library for **data manipulation and analysis**.  
It provides two primary data structures:

- **Series**: One-dimensional labeled array (like a column in Excel).
- **DataFrame**: Two-dimensional labeled data structure (like a table in SQL/Excel).

Pandas is built on **NumPy** and integrates well with **Matplotlib, Scikit-learn, and other data science libraries**.  
It is widely used for **cleaning, transforming, and analyzing structured data**.


##  Applications

- Reading and writing data from different file formats (CSV, Excel, SQL, JSON, etc.)
- Data cleaning: handling missing values, duplicates.
- Data transformation: filtering, grouping, merging, reshaping.
- Statistical analysis and summary reports.
- Time-series analysis.
- Integration with **data visualization** libraries.

---


## 3. List of Functions and Their Usage

Some important Pandas functions:

| Function | Usage |
|----------|--------|
| `pd.Series()` | Create a Series |
| `pd.DataFrame()` | Create a DataFrame |
| `pd.read_csv()` | Load CSV file |
| `pd.read_excel()` | Load Excel file |
| `df.head(), df.tail()` | First/last rows |
| `df.info(), df.describe()` | Summary info/statistics |
| `df['col']` | Access column |
| `df[['col1','col2']]` | Access multiple columns |
| `df.loc[]` | Access rows/columns by labels |
| `df.iloc[]` | Access rows/columns by index |
| `df.isnull(), df.dropna(), df.fillna()` | Handling missing values |
| `df.drop(columns=[])` | Drop columns |
| `df.sort_values()` | Sorting |
| `df.groupby()` | Grouping and aggregation |
| `df.merge(), df.join(), pd.concat()` | Combining DataFrames |
| `df.to_csv()` | Save DataFrame to CSV |


In [1]:
# Importing Pandas
import pandas as pd

# Creating a Series
data = [10, 20, 30, 40]
series = pd.Series(data, index=['a','b','c','d'])
print("Series:\n", series)

# Creating a DataFrame
data_dict = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 32],
    'City': ['New York', 'Paris', 'London', 'Tokyo']
}
df = pd.DataFrame(data_dict)
print("\nDataFrame:\n", df)

# Reading CSV file (example)
#df = pd.read_csv("data.csv")

# Basic DataFrame info
print("\nHead:\n", df.head())
print("\nInfo:\n")
print(df.info())
print("\nDescribe:\n", df.describe())

Series:
 a    10
b    20
c    30
d    40
dtype: int64

DataFrame:
       Name  Age      City
0    Alice   24  New York
1      Bob   27     Paris
2  Charlie   22    London
3    David   32     Tokyo

Head:
       Name  Age      City
0    Alice   24  New York
1      Bob   27     Paris
2  Charlie   22    London
3    David   32     Tokyo

Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   City    4 non-null      object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes
None

Describe:
              Age
count   4.000000
mean   26.250000
std     4.349329
min    22.000000
25%    23.500000
50%    25.500000
75%    28.250000
max    32.000000


## Examples

In [3]:
# Example 1: Selecting columns and rows
print("Single Column:\n", df['Name'])
print("\nMultiple Columns:\n", df[['Name', 'Age']])
print("\nRow by index (iloc):\n", df.iloc[1])
print("\nRow by label (loc):\n", df.loc[2])

# Example 2: Filtering data
print("\nPeople older than 25:\n", df[df['Age'] > 25])

# Example 3: Adding a new column
df['Salary'] = [50000, 60000, 55000, 70000]
print("\nWith Salary Column:\n", df)

# Example 4: Handling missing data
df2 = pd.DataFrame({
    'A': [1, 2, None, 4],
    'B': [5, None, None, 8]
})
print("\nOriginal with NaN:\n", df2)
print("\nDrop NaN:\n", df2.dropna())
print("\nFill NaN with 0:\n", df2.fillna(5))

# Example 5: GroupBy and Aggregation
grouped = df.groupby('City')['Salary'].mean()
print("\nAverage Salary by City:\n", grouped)

# Example 6: Sorting
print("\nSorted by Age:\n", df.sort_values(by='Age'))

# Example 7: Merging DataFrames
df_extra = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Department': ['HR', 'IT']
})
merged = pd.merge(df, df_extra, on='Name', how='left')
print("\nMerged DataFrame:\n", merged)


Single Column:
 0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object

Multiple Columns:
       Name  Age
0    Alice   24
1      Bob   27
2  Charlie   22
3    David   32

Row by index (iloc):
 Name        Bob
Age          27
City      Paris
Salary    60000
Name: 1, dtype: object

Row by label (loc):
 Name      Charlie
Age            22
City       London
Salary      55000
Name: 2, dtype: object

People older than 25:
     Name  Age   City  Salary
1    Bob   27  Paris   60000
3  David   32  Tokyo   70000

With Salary Column:
       Name  Age      City  Salary
0    Alice   24  New York   50000
1      Bob   27     Paris   60000
2  Charlie   22    London   55000
3    David   32     Tokyo   70000

Original with NaN:
      A    B
0  1.0  5.0
1  2.0  NaN
2  NaN  NaN
3  4.0  8.0

Drop NaN:
      A    B
0  1.0  5.0
3  4.0  8.0

Fill NaN with 0:
      A    B
0  1.0  5.0
1  2.0  5.0
2  5.0  5.0
3  4.0  8.0

Average Salary by City:
 City
London      55000.0
New York    50000