# Name: Iman Noor
## Submission Date: 27-06-2024

# **Introduction to Pandas (Series, DataFrame basics)**

## **Pandas**
- Pandas is an `open-source library` that is built on top of NumPy library. 
- It is a Python package that offers various data structures and operations for manipulating numerical data and time series. 
- It is mainly popular for importing and analyzing data much easier. - Pandas is fast and it has high-performance & productivity for users.
- Pandas is well-suited for working with **tabular data**, such as **spreadsheets** or **SQL tables**.

## **What is Python Pandas used for?**
- The Pandas library is generally used for data science --> This is because the Pandas library is used in conjunction with other libraries that are used for data science.
- The data produced by Pandas is often used as input for plotting functions in Matplotlib, statistical analysis in SciPy, and machine learning algorithms in Scikit-learn.

## **Creation of Series**

**Series**
- A Series is a `one-dimensional array-like object` containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its `index`.
- The simplest Series is formed from only an array of data:

### **Libraries**

In [2]:
import numpy as np
import pandas as pd

## **Q. Create a Pandas Series from a Python list, numpy array, and a dictionary.**

### **List:**

In [3]:
p_list = pd.Series([2,5,8,6])
p_list

0    2
1    5
2    8
3    6
dtype: int64

### **NumPy array:**

In [4]:
arr = np.array(['I','M','A','N'])
n_arr = pd.Series(arr)
n_arr

0    I
1    M
2    A
3    N
dtype: object

### **Dictionary**

In [5]:
dct = pd.Series({"Iman":21, "Year":2002})
dct

Iman      21
Year    2002
dtype: int64

## **Q. Assign a custom index to the Series.**

In [6]:
lst = pd.Series([2,9,1,0], index=['d','a','e','b'])
lst

d    2
a    9
e    1
b    0
dtype: int64

In [7]:
d_data = {'Ohio':35000, 'Texas':71000, 'Dakota':80000}
states = ['California', 'Ohio', 'Texas', 'Dakota']
obj = pd.Series(d_data, index=states)
obj

California        NaN
Ohio          35000.0
Texas         71000.0
Dakota        80000.0
dtype: float64

## **Q. Perform basic arithmetic operations on Series.**

In [8]:
s_1 = pd.Series([2,5,6,9])
s_2 = pd.Series([1,2,3,4])
s_3 = s_1+s_2
print("Addition:\n",s_3)
s_5 = s_1-s_2
print("Subtraction:\n",s_5)
s_4 = s_1*s_2
print("Multiplication:\n", s_4)
s_6 = s_1/s_2
print("Division:\n",s_6)

Addition:
 0     3
1     7
2     9
3    13
dtype: int64
Subtraction:
 0    1
1    3
2    3
3    5
dtype: int64
Multiplication:
 0     2
1    10
2    18
3    36
dtype: int64
Division:
 0    2.00
1    2.50
2    2.00
3    2.25
dtype: float64


## **Elements of a series can be accessed in two ways:**

- Accessing Element from Series with Position
- Accessing Element Using Label (index)

### **Accessing Element from Series with Position**
- In order to access the series element refers to the index number. Use the index operator `[ ]` to access an element in a series. The index must be an integer.

## **Q. Access elements using index labels and positions.**

In [9]:
print("Accessing element with Position: ", s_1[2])

Accessing element with Position:  6


In [10]:
print("Accessing 1st three elements with Position: \n", s_1[:3])

Accessing 1st three elements with Position: 
 0    2
1    5
2    6
dtype: int64


In [11]:
print("Accessing last two elements with Position: \n", s_1[-2:])

Accessing last two elements with Position: 
 2    6
3    9
dtype: int64


### **Access an Element in Pandas Using Label**
- In order to access an element from series, we have to set values by index label. 
- A Series is like a fixed-size dictionary in that you can get and set values by index label.

In [12]:
lst = pd.Series([3,5,7,9], index=['d','a','e','b'])
print(lst[['d','e']])

d    3
e    7
dtype: int64


## **Q. Filter the Series to include only values greater than a specific threshold.**


In [13]:
ser = pd.Series(np.arange(3,33))
print("Elements greater than 28 (within range(3-33)):\n", ser.loc[lambda x:x>28])

Elements greater than 28 (within range(3-33)):
 26    29
27    30
28    31
29    32
dtype: int32


In [14]:
ser = pd.Series(np.arange(3,33))
print("Elements less than 10 and greater than 30 (within range(3-33)):\n", ser.loc[lambda x:(x<10)|(x>30)])

Elements less than 10 and greater than 30 (within range(3-33)):
 0      3
1      4
2      5
3      6
4      7
5      8
6      9
28    31
29    32
dtype: int32


## **Q. Create a DataFrame from a dictionary of lists.**

In [15]:
data = [{'Bytewise': 'dataframe', 'BWT': 'using', 'G2': 'list'}, 
        {'Bytewise':10, 'BWT': 20, 'G2': 30}] 
df = pd.DataFrame(data)
df

Unnamed: 0,Bytewise,BWT,G2
0,dataframe,using,list
1,10,20,30


In [16]:
# changing index
df_2 = pd.DataFrame(data, index=['1','2'])
df_2

Unnamed: 0,Bytewise,BWT,G2
1,dataframe,using,list
2,10,20,30


## **Q. Create a DataFrame from a numpy array, specifying column and index names.**

In [17]:
arr = np.array([[1,3],[2,4],[5,7]])
df_n = pd.DataFrame(arr, columns=['A','B'], index=['a','b','c'])
df_n

Unnamed: 0,A,B
a,1,3
b,2,4
c,5,7


## **Q. Load a DataFrame from a CSV file.**

In [18]:
heart_disease = pd.read_csv('heart.csv')
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


## **Q. Display the first and last five rows of the DataFrame.**

In [19]:
print("First 5 rows of dataframe:\n")
heart_disease.head(5)

First 5 rows of dataframe:



Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [20]:
print("Last 5 rows of dataframe:\n")
heart_disease.tail(5)

Last 5 rows of dataframe:



Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1
1024,54,1,0,120,188,0,1,113,0,1.4,1,1,3,0


## **Q. Get a summary of the DataFrame including the mean, median, and standard deviation of numeric columns.**

In [21]:
heart_disease.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

In [22]:
heart_disease.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,54.434146,0.69561,0.942439,131.611707,246.0,0.149268,0.529756,149.114146,0.336585,1.071512,1.385366,0.754146,2.323902,0.513171
std,9.07229,0.460373,1.029641,17.516718,51.59251,0.356527,0.527878,23.005724,0.472772,1.175053,0.617755,1.030798,0.62066,0.50007
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,132.0,0.0,0.0,1.0,0.0,2.0,0.0
50%,56.0,1.0,1.0,130.0,240.0,0.0,1.0,152.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,275.0,0.0,1.0,166.0,1.0,1.8,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


## **Q. Extract a specific column as a Series.**

In [23]:
h_s = pd.Series(heart_disease['age'])
h_s

0       52
1       53
2       70
3       61
4       62
        ..
1020    59
1021    60
1022    47
1023    50
1024    54
Name: age, Length: 1025, dtype: int64

## **Q. Filter rows based on column values.**

In [24]:
data = pd.DataFrame(heart_disease, columns=['age', 'chol', 'target'])
data

Unnamed: 0,age,chol,target
0,52,212,0
1,53,203,0
2,70,174,0
3,61,203,0
4,62,294,0
...,...,...,...
1020,59,221,1
1021,60,258,0
1022,47,275,0
1023,50,254,1


## **Q. Select rows based on multiple conditions.**

In [25]:
print("Age less than 50 years:\n")
heart_disease[heart_disease['age']<50]

Age less than 50 years:



Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
8,46,1,0,120,249,0,0,144,0,0.8,2,0,3,0
11,43,0,0,132,341,1,0,136,1,3.0,1,0,3,0
12,34,0,1,118,210,0,1,192,0,0.7,2,0,2,1
15,34,0,1,118,210,0,1,192,0,0.7,2,0,2,1
22,45,1,0,104,208,0,0,148,1,3.0,1,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1012,48,1,1,110,229,0,1,168,0,1.0,0,0,3,0
1014,44,0,2,108,141,0,1,175,0,0.6,1,0,2,1
1018,41,1,0,110,172,0,0,158,0,0.0,2,0,3,0
1019,47,1,0,112,204,0,1,143,0,0.1,2,0,2,1


In [26]:
print("Age less than 50 years and sex=1:\n")
heart_disease[(heart_disease['age']<50) & (heart_disease['sex']==1)]

Age less than 50 years and sex=1:



Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
8,46,1,0,120,249,0,0,144,0,0.8,2,0,3,0
22,45,1,0,104,208,0,0,148,1,3.0,1,0,2,1
26,44,1,2,130,233,0,1,179,1,0.4,2,0,2,1
30,44,1,0,120,169,0,1,144,1,2.8,0,0,1,0
35,46,1,2,150,231,0,1,147,0,3.6,1,0,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1011,45,1,1,128,308,0,0,170,0,0.0,2,0,2,1
1012,48,1,1,110,229,0,1,168,0,1.0,0,0,3,0
1018,41,1,0,110,172,0,0,158,0,0.0,2,0,3,0
1019,47,1,0,112,204,0,1,143,0,0.1,2,0,2,1


## **Q. Add a new column to the DataFrame.**

In [27]:
heart_disease['risk_level'] = heart_disease['age'].apply(lambda x: 'High' if x > 50 else 'Low')
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,risk_level
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0,High
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0,High
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0,High
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0,High
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0,High
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1,High
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0,High
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0,Low
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1,Low


## **Q. Delete a column from the DataFrame.**

In [28]:
heart_disease = heart_disease.drop(columns=['cp'])
print("Modified DataFrame:\n")
heart_disease

Modified DataFrame:



Unnamed: 0,age,sex,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,risk_level
0,52,1,125,212,0,1,168,0,1.0,2,2,3,0,High
1,53,1,140,203,1,0,155,1,3.1,0,0,3,0,High
2,70,1,145,174,0,1,125,1,2.6,0,0,3,0,High
3,61,1,148,203,0,1,161,0,0.0,2,1,3,0,High
4,62,0,138,294,1,1,106,0,1.9,1,3,2,0,High
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,140,221,0,1,164,1,0.0,2,0,2,1,High
1021,60,1,125,258,0,0,141,1,2.8,1,1,3,0,High
1022,47,1,110,275,0,0,118,1,1.0,1,1,2,0,Low
1023,50,0,110,254,0,0,159,0,0.0,2,0,2,1,Low


## **Q. Rename columns in the DataFrame.**

In [29]:
heart_disease.rename(columns={'age':'AGE', 'sex':'SEX', 'chol':'Cholesterol'}, inplace=True)
heart_disease

Unnamed: 0,AGE,SEX,trestbps,Cholesterol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,risk_level
0,52,1,125,212,0,1,168,0,1.0,2,2,3,0,High
1,53,1,140,203,1,0,155,1,3.1,0,0,3,0,High
2,70,1,145,174,0,1,125,1,2.6,0,0,3,0,High
3,61,1,148,203,0,1,161,0,0.0,2,1,3,0,High
4,62,0,138,294,1,1,106,0,1.9,1,3,2,0,High
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,140,221,0,1,164,1,0.0,2,0,2,1,High
1021,60,1,125,258,0,0,141,1,2.8,1,1,3,0,High
1022,47,1,110,275,0,0,118,1,1.0,1,1,2,0,Low
1023,50,0,110,254,0,0,159,0,0.0,2,0,2,1,Low


## **Q. Create a DataFrame with columns Employee, Department, and Salary. Add a new column Bonus which is 10% of the salary for the Engineering department, 5% for HR, and 7% for Marketing.**

> **func:** .apply takes a function and applies it to all values of pandas series. **convert_dtype:** Convert dtype as per the function’s operation. **args=():** Additional arguments to pass to function instead of series. **Return Type:** Pandas Series after applied function/operation.

In [30]:
data = {
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Department': ['HR', 'Engineering', 'Marketing', 'Engineering', 'HR'],
    'Salary': [60000, 85000, 50000, 90000, 65000]
}
df = pd.DataFrame(data)
df['Bonus'] = df.apply(
    lambda row: row['Salary'] * 0.10 if row['Department'] == 'Engineering' 
    else (row['Salary'] * 0.05 if row['Department'] == 'HR' 
          else row['Salary'] * 0.07),
    axis=1
)
df

Unnamed: 0,Employee,Department,Salary,Bonus
0,Alice,HR,60000,3000.0
1,Bob,Engineering,85000,8500.0
2,Charlie,Marketing,50000,3500.0
3,David,Engineering,90000,9000.0
4,Eve,HR,65000,3250.0


## **Q. Given a DataFrame sales_data with columns Date, Store, Product, Revenue, create a multi-index DataFrame that shows the total revenue for each Store and Product combination for each month.**

In [31]:
data = {
    'Date': pd.date_range(start='2021-01-01', periods=100, freq='D'),
    'Store': np.random.choice(['Store_A', 'Store_B', 'Store_C'], 100),
    'Product': np.random.choice(['Product_1', 'Product_2', 'Product_3'], 100),
    'Revenue': np.random.uniform(100, 1000, 100)
}
sales_data = pd.DataFrame(data)
sales_data['Month'] = sales_data['Date'].dt.to_period('M')
monthly_revenue = sales_data.groupby(['Month', 'Store', 'Product']).agg({'Revenue': 'sum'}).reset_index()
multi_index_df = monthly_revenue.set_index(['Month', 'Store', 'Product'])

multi_index_df.head(15)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Revenue
Month,Store,Product,Unnamed: 3_level_1
2021-01,Store_A,Product_1,1316.918372
2021-01,Store_A,Product_2,4128.261821
2021-01,Store_A,Product_3,1143.052311
2021-01,Store_B,Product_1,2211.725408
2021-01,Store_B,Product_2,679.064673
2021-01,Store_B,Product_3,1084.503576
2021-01,Store_C,Product_1,1672.011642
2021-01,Store_C,Product_2,3295.636674
2021-01,Store_C,Product_3,3699.997969
2021-02,Store_A,Product_1,1785.764052


# **The End:)**