<h1 style="color:springgreen">Pandas</h1>


**Pandas** is a powerful open-source data analysis and manipulation library for Python. It provides data structures like **DataFrame** and **Series**, which make handling large datasets more efficient and intuitive.

### Key Features of Pandas:
- **Load, Filter, and Manipulate Data:**
  - Work with structured data (e.g., CSV, Excel, SQL).
  
- **Data Cleaning and Preprocessing:**
  - Handle missing data, duplicates, or outliers.

- **Aggregating and Grouping Data:**
  - Group data for deeper analysis and insights.

- **Merging and Joining Datasets:**
  - Combine data from multiple sources using merge, join, or concatenate functions.

- **Time Series Analysis and Statistical Computations:**
  - Analyze time-based data and perform various statistical operations.

- **Data Visualization:**
  - Create visualizations by combining Pandas with libraries like Matplotlib or Seaborn.

### Common Uses:
- **Data Wrangling**
- **Exploratory Data Analysis (EDA)**
- **Building Data Pipelines**

Pandas is essential for anyone working in data science or machine learning, especially for tasks like dask in Pandas you’re working on?


https://github.com/pandas-dev/pandas

In [17]:
# !pip install pandas

In [29]:
# importing pandas library

import pandas as pd

mydataset = {
    'cars': ["BMW", "Volvo","Ford"],
    'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

myvar

Unnamed: 0,cars,passings
0,BMW,3
1,Volvo,7
2,Ford,2


In [31]:
# checking pandas version 

print(pd.__version__)

2.2.2


### Pandas Series
- **1D Structure**: A one-dimensional labeled array that can store data of any type (integers, floats, strings).

- **Indexing**: Each element has an index (default is 0, 1, 2, …). You can customize the index.

In [107]:
data = [100, 200, 300]
series = pd.Series(data, index=['a', 'b', 'c'])
print(series['a'])  # Output: 100
series

100


a    100
b    200
c    300
dtype: int64

- **Data Types**: Can hold mixed data types like integers, floats, strings.

In [110]:
pd.Series([1, 'Hello', 3.5])


0        1
1    Hello
2      3.5
dtype: object

- **Vectorized Operations**: Perform element-wise operations without loops.

In [115]:
series + 10  # Adds 10 to each element


a    110
b    210
c    310
dtype: int64

- **Null Values**: Check for missing data with .isnull().

In [120]:
pd.Series([1, np.nan, 3]).isnull()


0    False
1     True
2    False
dtype: bool

#### Basic Operations:
- **Sum**: `series.sum()`
- **Mean**: `series.mean()`
- **Count non-null values**: `series.count()`


In [125]:
import pandas as pd

data = [10, 20, 30]
series = pd.Series(data)
print(series + 5)  # Output: 15, 25, 35


0    15
1    25
2    35
dtype: int64


### DataFrame

A **DataFrame** in Pandas is a 2D, labeled data structure similar to a table in a database or an Excel spreadsheet. It’s one of the most important and widely used objects in Pandas for data manipulation and analysis.

**Key Features of DataFrame**:

- **2D Structure**: DataFrame is a table-like structure where data is aligned in rows and columns.

- **Indexing**: Like Series, each row has an index, and each column has a label (column name). Both can be customized.

- **Multiple Data Types**: Each column can hold different types of data (integers, floats, strings, etc.).

- **Operations**: You can filter, sort, merge, group, and manipulate data efficiently.



In [149]:
#creating a DataFrame: from dictionary:)

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
df


Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,Charlie,35


In [151]:
# from list:)

data = [[1, 'Apple'], [2, 'Banana'], [3, 'Cherry']]
df2 = pd.DataFrame(data, columns=['ID', 'Fruit'])
df2


Unnamed: 0,ID,Fruit
0,1,Apple
1,2,Banana
2,3,Cherry


#### Basic Operations

In [154]:
# ACCESS COLUMNS:

print(df['Name'])

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object


In [156]:
# ACCESS ROWS:)

print(df.iloc[0])

Name    Alice
Age        25
Name: 0, dtype: object


In [158]:
# FILTERING:)

print(df[df['Age'] > 25])  # Rows where Age > 25


      Name  Age
1      Bob   30
2  Charlie   35


In [160]:
#ADD NEW COLUMN:)

df['Salary'] = [50000, 60000, 70000]


In [162]:
df

Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,30,60000
2,Charlie,35,70000


### Useful Functions:

- **Descriptive Stats**: `df.describe()` – Summary statistics for numerical columns.
- **Null Values**: `df.isnull()` – Check for missing data.
- **Dropping Rows/Columns**: `df.drop()` – Remove rows or columns.

In [165]:
# Create a DataFrame and calculate new column
df['Salary'] = [50000, 60000, 70000]
df['Bonus'] = df['Salary'] * 0.1
print(df)


      Name  Age  Salary   Bonus
0    Alice   25   50000  5000.0
1      Bob   30   60000  6000.0
2  Charlie   35   70000  7000.0


In [172]:
print(df.loc[[0,1]])

    Name  Age  Salary   Bonus
0  Alice   25   50000  5000.0
1    Bob   30   60000  6000.0


In [196]:
# Load a comma separated file (CSV file) into a DataFrame:

df = pd.read_csv('data.csv')

df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.4
166,60,115,145,310.2
167,75,120,150,320.4


In [200]:
df = pd.read_csv('data.csv')

print(df.to_string())    #to print the entire DataFrame.

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.5
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

In [205]:
# Check the number of maximum returned rows:

print(pd.options.display.max_rows)

60


In [207]:
pd.options.display.max_rows = 9999

df = pd.read_csv('data.csv')

print(df) 

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.5
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

In [209]:
# Load the JSON file into a DataFrame:

df = pd.read_json('data.json')

print(df.to_string())

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.5
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

In [211]:
# JSON = Python Dictionary

# JSON objects have the same format as Python dictionaries.

In [216]:
# Load a Python Dictionary into a DataFrame:)

data = {
  "Duration":{
    "0":60,
    "1":60,
    "2":60,
    "3":45,
    "4":45,
    "5":60
  },
  "Pulse":{
    "0":110,
    "1":117,
    "2":103,
    "3":109,
    "4":117,
    "5":102
  },
  "Maxpulse":{
    "0":130,
    "1":145,
    "2":135,
    "3":175,
    "4":148,
    "5":127
  },
  "Calories":{
    "0":409,
    "1":479,
    "2":340,
    "3":282,
    "4":406,
    "5":300
  }
}

df = pd.DataFrame(data)

print(df) 

   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
5        60    102       127       300


#### Pandas - Analyzing DataFrames

In [219]:
# Get a quick overview by printing the first 10 rows of the DataFrame:

df = pd.read_csv('data.csv')

print(df.head(10))

   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
2        60    103       135     340.0
3        45    109       175     282.4
4        45    117       148     406.0
5        60    102       127     300.5
6        60    110       136     374.0
7        45    104       134     253.3
8        30    109       133     195.1
9        60     98       124     269.0


In [221]:
print(df.head())

   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
2        60    103       135     340.0
3        45    109       175     282.4
4        45    117       148     406.0


In [223]:
print(df.tail()) 

     Duration  Pulse  Maxpulse  Calories
164        60    105       140     290.8
165        60    110       145     300.4
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4


In [225]:
print(df.info()) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None


In [None]:
# first 2 lines(1, 2) indicates the rows no and the columns

# then next line indicates the column name and the datatypes
