#### Q1. List any five functions of the pandas library with execution.

* Here are five functions of the pandas library with their execution:
* **1. read_csv()** function: This function reads a CSV file and converts it into a pandas DataFrame.

In [1]:
import pandas as pd
# read a CSV file
df = pd.read_csv('mydata.csv')

* **2. head()** function: This function returns the first n rows of a DataFrame.

In [2]:
import pandas as pd

# create a sample dataframe
data = {'date': ['2022-01-01', '2022-01-01', '2022-01-02', '2022-01-02', '2022-01-03', '2022-01-03'],
        'category': ['A', 'B', 'A', 'B', 'A', 'B'],
        'value': [10, 20, 30, 40, 50, 60]}
df = pd.DataFrame(data)
# display top 5 rows
df.head()

Unnamed: 0,date,category,value
0,2022-01-01,A,10
1,2022-01-01,B,20
2,2022-01-02,A,30
3,2022-01-02,B,40
4,2022-01-03,A,50


* **3. describe()** function: This function provides descriptive statistics for a DataFrame, such as count, mean, standard deviation, minimum and maximum values, and quartiles.

In [3]:
df.describe()

Unnamed: 0,value
count,6.0
mean,35.0
std,18.708287
min,10.0
25%,22.5
50%,35.0
75%,47.5
max,60.0


* **4. groupby()** function: This function groups a DataFrame by one or more columns and applies a function to each group.

In [4]:
# group the DataFrame by the 'category' column and get the mean of the 'value' column for each group
df.groupby('category')['value'].mean()

category
A    30.0
B    40.0
Name: value, dtype: float64

* **5. pivot_table()** function: This function creates a pivot table from a DataFrame

In [5]:
# create a pivot table from the DataFrame, with 'category' as rows, 'date' as columns, and 'value' as values
df.pivot_table(index='category', columns='date', values='value')

date,2022-01-01,2022-01-02,2022-01-03
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,10,30,50
B,20,40,60


# -------------------------------------------------

#### Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the DataFrame with a new index that starts from 1 and increments by 2 for each row.

In [6]:
import pandas as pd
def reindex_df(df):
    new_index = pd.RangeIndex(start=1, step=2, stop=len(df)*2)
    return df.reindex(new_index)


In [7]:
# create an example DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8], 'C': [9, 10, 11, 12]})
print("Dataframe before re-index")
print(df)
print("-------------------------")

# re-index the DataFrame
df = reindex_df(df)
print("Dataframe after re-index")
print(df)

Dataframe before re-index
   A  B   C
0  1  5   9
1  2  6  10
2  3  7  11
3  4  8  12
-------------------------
Dataframe after re-index
     A    B     C
1  2.0  6.0  10.0
3  4.0  8.0  12.0
5  NaN  NaN   NaN
7  NaN  NaN   NaN


# ---------------------------------------------------

#### Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The function should print the sum to the console.

In [8]:
import pandas as pd

def sum_df(df):
    sum = 0
    for i,r in df.iterrows():
        if i == 3:
            break
        else:
            sum += r['Values']
    print("The sum of 1st 3 values are : ",sum)


In [9]:
df = pd.DataFrame()
df['Values']= [10, 20, 30, 40, 50]
print(df)
print("-------------------------")
sum_df(df)

   Values
0      10
1      20
2      30
3      40
4      50
-------------------------
The sum of 1st 3 values are :  60


# ------------------------------------------------------

#### Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column 'Word_Count' that contains the number of words in each row of the 'Text' column.

In [10]:
import pandas as pd
df = pd.DataFrame()
df['Text']= ["Given a Pandas DataFrame df with a column 'Text'","write a Python function to create a new column 'Word_Count'","hat contains the number of words in each row of the 'Text' column"]
df

Unnamed: 0,Text
0,Given a Pandas DataFrame df with a column 'Text'
1,write a Python function to create a new column...
2,hat contains the number of words in each row o...


In [11]:
# Split text by space and count number of words in each row
df['Word_Count'] = df['Text'].apply(lambda x: len(str(x).split(" ")))
df

Unnamed: 0,Text,Word_Count
0,Given a Pandas DataFrame df with a column 'Text',9
1,write a Python function to create a new column...,10
2,hat contains the number of words in each row o...,13


# ----------------------------------------------------------

#### Q5. How are DataFrame.size() and DataFrame.shape() different?

* Both DataFrame.size() and DataFrame.shape() are used to get information about the dimensions of a Pandas DataFrame, but they are different in what they return.

* **DataFrame.size()** returns the total number of elements in a DataFrame. It calculates the size by multiplying the number of rows by the number of columns. So, the output of DataFrame.size() is a single integer representing the total number of elements in the DataFrame.
* **DataFrame.shape()** returns a tuple of two integers representing the dimensions of the DataFrame. The first element of the tuple is the number of rows, and the second element is the number of columns.
---
##### Example

In [12]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9], 'D': [10, 11, 12]})

print("The size of the dataframe is : ",df.size)
print("The shape of the dataframe is : ",df.shape)

The size of the dataframe is :  12
The shape of the dataframe is :  (3, 4)


# ------------------------------------------------------------

#### Q6. Which function of pandas do we use to read an excel file?

* **read_excel()** function is part of the pandas library and allows us to read data from Excel files into a pandas DataFrame.
---
##### Example

In [14]:
import pandas as pd
df = pd.read_excel("sample.xlsx")

# ------------------------------------------------------------

#### Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email addresses in the format 'username@domain.com'. Write a Python function that creates a new column 'Username' in df that contains only the username part of each email address. The username is the part of the email address that appears before the '@' symbol. For example, if the email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your function should extract the username from each email address and store it in the new 'Username' column.

In [15]:
import pandas as pd

def extract_username(df):
    df["Username"] = df['Email'].apply(lambda x:x.split("@")[0])
    print("The dataframe after username extraction...")
    print("--------------------------------------")
    print(df)


In [16]:
email = {"Email":["madhan.kumar@gmail.com","priya@yahoo.com","elakiya.madhan@gmail.com","sandhiya@hotmail.com"]}
df = pd.DataFrame(email)
print("The original dataframe...")
print("--------------------------------------")
print(df)
print("--------------------------------------")
extract_username(df)

The original dataframe...
--------------------------------------
                      Email
0    madhan.kumar@gmail.com
1           priya@yahoo.com
2  elakiya.madhan@gmail.com
3      sandhiya@hotmail.com
--------------------------------------
The dataframe after username extraction...
--------------------------------------
                      Email        Username
0    madhan.kumar@gmail.com    madhan.kumar
1           priya@yahoo.com           priya
2  elakiya.madhan@gmail.com  elakiya.madhan
3      sandhiya@hotmail.com        sandhiya


# ----------------------------------------------------------

#### Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The function should return a new DataFrame that contains only the selected rows.

In [17]:
import pandas as pd
def filter_df(df):
    new_df = df[(df["A"] > 5) & (df["B"] < 10)]
    return new_df

In [18]:
from random import randint
## creating a dataframe with random numbers between 1 to 12
lst = {"A":[randint(1,12) for i in range(5)],"B":[randint(1,12) for i in range(5)],"C":[randint(1,12) for i in range(5)]}
df = pd.DataFrame(lst)
df

Unnamed: 0,A,B,C
0,11,3,7
1,7,10,2
2,12,6,7
3,3,5,9
4,1,5,11


In [19]:
# After filter
filter_df(df)

Unnamed: 0,A,B,C
0,11,3,7
2,12,6,7


# -----------------------------------------------

#### Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean, median, and standard deviation of the values in the 'Values' column.

In [20]:
def calc_df(df):
    print("Mean : ",df['Values'].mean())
    print("Median : ", df['Values'].median())
    print("STD : ", df['Values'].std())

In [21]:
import pandas as pd
from random import randint
# Create a sample DataFrame with random values in the 'Values' column
df = pd.DataFrame({'Values': [randint(1, 10) for i in range(10)]})
df

Unnamed: 0,Values
0,2
1,2
2,2
3,8
4,9
5,10
6,4
7,9
8,3
9,10


In [22]:
calc_df(df)

Mean :  5.9
Median :  6.0
STD :  3.573047252229764


# ------------------------------------------

#### Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days for each row in the DataFrame. The moving average should be calculated using a window of size 7 and should include the current day.

In [23]:
def moving_average(df):
    rolling_mean = df['Sales'].rolling(window=7, min_periods=1).mean()
    df['MovingAverage'] = rolling_mean
    return df

In [24]:
import pandas as pd
from random import randint

# Create a sample DataFrame with random sales data and dates for the past 10 days
dates = pd.date_range(start='2022-01-01', periods=10, freq='D')
sales = [randint(0, 100) for i in range(10)]
df = pd.DataFrame({'Date': dates, 'Sales': sales})
df


Unnamed: 0,Date,Sales
0,2022-01-01,29
1,2022-01-02,7
2,2022-01-03,78
3,2022-01-04,74
4,2022-01-05,23
5,2022-01-06,21
6,2022-01-07,57
7,2022-01-08,91
8,2022-01-09,54
9,2022-01-10,18


In [25]:
# calculate moving average
moving_average(df)

Unnamed: 0,Date,Sales,MovingAverage
0,2022-01-01,29,29.0
1,2022-01-02,7,18.0
2,2022-01-03,78,38.0
3,2022-01-04,74,47.0
4,2022-01-05,23,42.2
5,2022-01-06,21,38.666667
6,2022-01-07,57,41.285714
7,2022-01-08,91,50.142857
8,2022-01-09,54,56.857143
9,2022-01-10,18,48.285714


# ----------------------------------------------

#### Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g. Monday, Tuesday) corresponding to each date in the 'Date' column.

In [26]:
def get_weekday(df):
    df['Weekday'] = df["Date"].dt.day_name()
    return df

In [27]:
import pandas as pd
dates = pd.date_range(start='2022-01-01', periods=10, freq='D')
df = pd.DataFrame({"Date":dates})
df

Unnamed: 0,Date
0,2022-01-01
1,2022-01-02
2,2022-01-03
3,2022-01-04
4,2022-01-05
5,2022-01-06
6,2022-01-07
7,2022-01-08
8,2022-01-09
9,2022-01-10


In [28]:
get_weekday(df)

Unnamed: 0,Date,Weekday
0,2022-01-01,Saturday
1,2022-01-02,Sunday
2,2022-01-03,Monday
3,2022-01-04,Tuesday
4,2022-01-05,Wednesday
5,2022-01-06,Thursday
6,2022-01-07,Friday
7,2022-01-08,Saturday
8,2022-01-09,Sunday
9,2022-01-10,Monday


# -------------------------------------------------

#### Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

In [29]:
def get_jan_dates(df):
    start_date = pd.Timestamp('2023-01-01')
    end_date = pd.Timestamp('2023-01-31')
    jan_df = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]
    return jan_df

In [30]:
from datetime import datetime,timedelta
from random import randint
## creating a dataframe with random dates
dates = [datetime.today() + timedelta(days = randint(-60,60)) for i in range(50)]
df = pd.DataFrame({"Date":dates})
df

Unnamed: 0,Date
0,2023-02-06 12:01:28.809136
1,2023-01-30 12:01:28.809136
2,2023-01-13 12:01:28.809136
3,2023-04-03 12:01:28.809136
4,2023-01-21 12:01:28.809136
5,2023-04-22 12:01:28.809136
6,2023-01-20 12:01:28.809136
7,2023-04-21 12:01:28.809136
8,2023-02-08 12:01:28.809136
9,2023-01-19 12:01:28.809136


In [31]:
## filter df
get_jan_dates(df)

Unnamed: 0,Date
1,2023-01-30 12:01:28.809136
2,2023-01-13 12:01:28.809136
4,2023-01-21 12:01:28.809136
6,2023-01-20 12:01:28.809136
9,2023-01-19 12:01:28.809136
10,2023-01-29 12:01:28.809136
13,2023-01-24 12:01:28.809136
14,2023-01-16 12:01:28.809136
15,2023-01-10 12:01:28.809136
16,2023-01-16 12:01:28.809136


# -----------------------------------------------

#### Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to be imported?

* The **pandas** library is used to use the basic functions of pandas.

In [32]:
import pandas