<a href="https://colab.research.google.com/github/poudyaldiksha/Data-Science-project/blob/main/Lesson_16_b2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 16: Data Analysis with Pandas

## **Introduction to Data Analysis**

Before diving into the technicalities of data analysis, it's important to appreciate the value and impact of understanding data. Data analysis is the process of examining, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making.

Imagine you have access to data on monthly sales for every shop in your city. By analyzing this data, you might notice that during festivals, the sales of sweets, clothes, jewelry, and electronic products spike significantly. This observation can help businesses plan their inventory and marketing strategies better to capitalize on these peak periods.

Similarly, if you analyze tourism data, you'll see that many families in India go on vacation in May and June. This trend aligns with the school summer vacations, and understanding this can help travel agencies and hotels plan their services accordingly.

Data analysis allows us to uncover trends, patterns, and insights that can inform decisions in daily life, business, healthcare, engineering, and numerous other fields.

##**Why Pandas for Data Analysis?**
When it comes to data analysis in Python, one of the most powerful and widely used libraries is Pandas. But why Pandas? Let's delve into its background and benefits:

**1. History and Development:**

- Pandas was developed by **Wes McKinney** in **2008** while he was working at AQR Capital Management. He needed a high-performance, flexible tool to perform quantitative analysis on financial data. The name "Pandas" is derived from the term **"panel data,"** an econometrics term for data sets that include observations over multiple time periods for the same individuals.
- Since its inception, Pandas has grown rapidly in popularity and capability, becoming an essential tool for data scientists and analysts around the world.

**2. Key Features and Advantages:**



- **Ease of Use:** Pandas provides easy-to-use data structures and data analysis tools that make data manipulation simple and intuitive.
- **Flexible Data Structures:** It introduces two main data structures, **Series** and **DataFrame**, which handle different types of data effectively.
- **Data Cleaning and Preparation:** With Pandas, you can clean and prepare your data effortlessly, handling missing data, merging datasets, and reshaping data.
- **Powerful Grouping and Aggregation:** Pandas allows for powerful data grouping, aggregation, and transformation operations, enabling complex data analysis tasks.
- **Integration with Other Libraries:** Pandas integrates seamlessly with other popular Python libraries like NumPy, Matplotlib, and SciPy, enhancing its capabilities.
- **Performance:** Built on top of NumPy, Pandas provides high-performance, fast, and efficient data manipulation.

##**Introduction to Pandas Series**
In this lesson, we'll start by learning about the Pandas Series, one of the core data structures in Pandas.

**What is a Pandas Series?**

A Pandas Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, Python objects, etc.). It is similar to a column in a spreadsheet or a database table. And it is labeled and indexed, making it more powerful than a regular list.





###Activity 1: Creating a Pandas Series

Let's create a simple Pandas Series to understand its structure and features:

To explore the concept of Pandas Series with an example. Suppose you have a collection of 20 books, and each book has a certain number of pages ranging from 150 to 500.
(both inclusive).

We can create a Pandas Series containing the number of pages in each book by first creating a Python list and then converting it to a Pandas Series. To create a Pandas Series, you need to import the Pandas module using the import keyword.

`import pandas as pd`
Here, `pd` is an alias (or nickname) for pandas

**Note:** Unlike other functions, the `Series()` function begins with the uppercase letter S.

In [None]:
import numpy as np
import pandas as pd

In [None]:
page_numbers = [ random.randint(150,500) for i in range(20) ]
page_numbers

[304,
 216,
 285,
 248,
 417,
 455,
 159,
 318,
 237,
 482,
 173,
 390,
 365,
 369,
 394,
 400,
 362,
 309,
 393,
 321]

In [None]:
p_number_series = pd.Series(page_numbers)
p_number_series

Unnamed: 0,0
0,304
1,216
2,285
3,248
4,417
5,455
6,159
7,318
8,237
9,482


In [None]:
len(page_numbers)

20

In [None]:
data = [
    ['Alice', 24, 'New York'],
    ['Bob', 27, 'Los Angeles'],
    ['Charlie', 22, 'Chicago'],
    ['David', 32, 'Houston']
]

data_series = pd.Series(data)
data_series

Unnamed: 0,0
0,"[Alice, 24, New York]"
1,"[Bob, 27, Los Angeles]"
2,"[Charlie, 22, Chicago]"
3,"[David, 32, Houston]"


In [None]:
data = [
    ['Alice', 24, 'New York'],
    ['Bob', 27, 'Los Angeles'],
    ['Charlie', 22, 'Chicago'],
    ['David', 32, 'Houston']
]

data_f = pd.DataFrame(data)
data_f

Unnamed: 0,0,1,2
0,Alice,24,New York
1,Bob,27,Los Angeles
2,Charlie,22,Chicago
3,David,32,Houston


In [None]:
#Activity
import pandas as pd
import random
# Create a list of page numbers for 20 books
page_numbers = [ random.randint(150,500) for i in range(20) ]
#print(page_numbers)

# Convert the list to a Pandas Series
book_pages = pd.Series(page_numbers)
book_pages


Unnamed: 0,0
0,469
1,188
2,450
3,399
4,156
5,348
6,207
7,257
8,182
9,197


In [None]:
print(type(book_pages))

<class 'pandas.core.series.Series'>


The first column in the output represents the indices of all the items in the book_pages Pandas series. The second column contains the number of pages in the book. The data-type of each item is an `int`.

In [None]:
#Example
x= pd.Series([1,2,3,4], index=(1,2,3,4))
x

Unnamed: 0,0
1,1
2,2
3,3
4,4


In [None]:
x= pd.Series([1,2,3,4])
x

Unnamed: 0,0
0,1
1,2
2,3
3,4


In [None]:
#Example

labels = ['a','b','c']
my_data = [10,20,30]


print(pd.Series(my_data))
print('==================')
print(pd.Series(my_data,index=labels))

0    10
1    20
2    30
dtype: int64
a    10
b    20
c    30
dtype: int64


In [None]:
#Example
roll_num =[ 748847, 887283,889832]
names=["rich","courage","shine"]
series1= pd.Series(roll_num,index=names)
series1

Unnamed: 0,0
rich,748847
courage,887283
shine,889832


In [None]:
# series of numbers from 11 to 20
ser = pd.Series(data = range(11,21),index=range(11,21))
ser

Unnamed: 0,0
11,11
12,12
13,13
14,14
15,15
16,16
17,17
18,18
19,19
20,20


In [None]:
# example
series2 = pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd', 'e'], dtype = float)
print(series2)

a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
dtype: float64


### Activity 2: Exploring the Pandas Series

A Pandas Series can also contain items of multiple data types. Let's store the title of a book, its number of pages, the price, and whether it's available in paperback.

In [None]:
# Create a Pandas Series with mixed data types
book_details = pd.Series(["the book",182,288.99,True])
book_details

Unnamed: 0,0
0,the book
1,182
2,288.99
3,True


In [None]:
print(type(book_details))

<class 'pandas.core.series.Series'>


**Note:**

Here, the data type is object. Pandas cannot return the data type of every individual item, so it returns object to represent a common data type for all items.

**Creating Empty Series**

In [None]:
book_details = pd.Series(("the book",182,288.99,True))
book_details

Unnamed: 0,0
0,the book
1,182
2,288.99
3,True


In [None]:
w = { "a": 11,"b":22,"c":33}
w_series = pd.Series(w)
w_series

Unnamed: 0,0
a,11
b,22
c,33


In [None]:
array1 = np.arange(1 , 11)
b_series = pd.Series(array1)
b_series

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
9,10


In [None]:
#empty series
emp_s = pd.Series([])
print(emp_s)

Series([], dtype: object)


In [None]:
emp_s = pd.Series(())
print(emp_s)

Series([], dtype: object)


In [None]:
emp_s = pd.Series({})
print(emp_s)

Series([], dtype: object)


**Creating series from scalar values**

In [None]:
#Example
s = pd.Series(666)
print(s)

0    666
dtype: int64


In [None]:
print(type(s))

<class 'pandas.core.series.Series'>


 **Creating series from python dictionary**

In [None]:
#example
s_dict = pd.Series({'a':1, 'b':2, 'c':3})
print(s_dict)

a    1
b    2
c    3
dtype: int64


**Adding two series**


In [None]:
#example
series1 = pd.Series([1,2,3,4,5])
series2 = pd.Series([1,2,3,4,5])

a = series1 + series2
print(a)

0     2
1     4
2     6
3     8
4    10
dtype: int64


**We can also add series using add() function:**

In [None]:
#example
series1.divide(series2)

Unnamed: 0,0
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0


### Activity 3: Finding the Number of Items
You can use the size attribute to find the number of items in a Pandas Series.



In [None]:
book_pages

Unnamed: 0,0
0,469
1,188
2,450
3,399
4,156
5,348
6,207
7,257
8,182
9,197


In [None]:
len(book_pages)

20

In [None]:
 #Find the number of items in the book_pages Pandas Series using the size attribute.
print(book_pages.size)

20


You can also use the shape attribute to find the number of rows and columns in a Pandas Series.

In [None]:

a = ("a",)
type(a)

tuple

In [None]:
a = "a",
type(a)

tuple

In [None]:
a = 2
type(a)

int

In [None]:
a, b, c = 1,2,3
print(a,b,c)

1 2 3


In [None]:
#Find the number of rows and columns in the book_pages Pandas Series using the shape attribute.
print(book_pages.shape)

(20,)


In [None]:
a = book_pages.shape
type(a)

tuple

In [None]:
a[0]

20

So, there are 20 rows and 1 column in the book_pages Pandas series.

### Activity 4: Calculating Mean, Min, and Max

You can calculate the mean, minimum, and maximum values in a Pandas Series using the `mean()`, `min()`, and `max()` functions.

In [None]:
#Calculate the average number of pages in the book_pages Series.
print(book_pages.mean())

311.75


In [None]:
#Find the minimum and maximum number of pages in the book_pages Series.
print(book_pages.min())
print(book_pages.max())

156
469



###Activity 5: Viewing specific data using `head()` and `tail()` Functions

To view the first few or the last few items in a Pandas Series, use the `head()` and `tail()` functions.

Sometimes instead of looking at the full dataset, we just want to look at the first few rows or the last few rows of the dataset. In such cases, we can use the `head()` and `tail()` function.

The `head()` function shows the first five and the `tail()` function shows the last five items in a Pandas series by default.


In [None]:
book_pages

Unnamed: 0,0
0,469
1,188
2,450
3,399
4,156
5,348
6,207
7,257
8,182
9,197


In [None]:
book_pages.head(3)

Unnamed: 0,0
0,469
1,188
2,450


In [None]:
#Print the first 5 items in the book_pages Series using the head() function.
book_pages.head()

Unnamed: 0,0
0,469
1,188
2,450
3,399
4,156


In [None]:
#Print the last 5 items in the book_pages Series using the tail() function.
book_pages.tail()

Unnamed: 0,0
15,451
16,439
17,334
18,203
19,412


You can also specify the number of items to display.

In [None]:
book_pages.tail(-3)

Unnamed: 0,0
3,399
4,156
5,348
6,207
7,257
8,182
9,197
10,308
11,267
12,164


In [None]:
book_pages.head(-3)

Unnamed: 0,0
0,469
1,188
2,450
3,399
4,156
5,348
6,207
7,257
8,182
9,197


In [None]:
 #Print the first 8 items in the book_pages Series using the head() function.
book_pages.head(7)

Unnamed: 0,0
0,469
1,188
2,450
3,399
4,156
5,348
6,207


### Activity 6: The `head()` & `tail()` Functions With Negative Inputs

Let `pd_series` be a Pandas series which contains `N` number of elements. Let `n` be some positive integer such that `n < N`.

The `pd_series.head(-n)` operation will return the **first** `N - n` items contained in the `pd_series`.


In [None]:
# 'head()' with negative input.
book_pages.head(-5)

Unnamed: 0,0
0,469
1,188
2,450
3,399
4,156
5,348
6,207
7,257
8,182
9,197


The `pd_series.tail(-n)` operation will return the **last** `N - n` items contained in the `pd_series`.

In [None]:
# 'tail()' with negative input.
book_pages.tail(-5)

Unnamed: 0,0
5,348
6,207
7,257
8,182
9,197
10,308
11,267
12,164
13,393
14,411


### Activity 7: Indexing a Pandas Series
Indexing a Pandas Series is similar to indexing a Python list or a NumPy array.

**Syntax:** `pandas_series[start_index:end_index:step_size]`

In [None]:
# Retrieve items with indices ranging from 5 to 10
book_pages[5:11]

Unnamed: 0,0
5,348
6,207
7,257
8,182
9,197
10,308


In [None]:
#Print the items ranging from indices 8 to 15.
book_pages[8:16:2]


Unnamed: 0,0
8,182
10,308
12,164
14,411


### Activity 8: Using the `mode()` Function
To find the most frequently occurring values in a Pandas Series, use the `mode()` function.

In [None]:
book_pages

Unnamed: 0,0
0,469
1,188
2,450
3,399
4,156
5,348
6,207
7,257
8,182
9,197


In [None]:
a = pd.Series([1,2,2,2,3,4])
a.mode()

Unnamed: 0,0
0,2


In [None]:
a = pd.Series([1,2,2,2,3,4])
a.value_counts(normalize = True)

Unnamed: 0,proportion
2,0.5
1,0.166667
3,0.166667
4,0.166667


In [None]:
a = pd.Series([11,22,33,44,22,22,22])


In [None]:
a = pd.Series([11, 22, 33, 44, 22, 22, 22])

# Get the count of each unique value
count_22 = a.value_counts().get(22, 0)
print(count_22)

4


In [None]:
b = a.count()
print(b)

7


In [None]:
a = pd.Series([11,22,33,44,22,22,22])

w = a.value_counts()
print(w)
type(w)

22    4
11    1
33    1
44    1
Name: count, dtype: int64


In [None]:
w.size

4

In [None]:
w[2]

3

In [None]:
#Compute the modal value in the book_pages Series.
book_pages.mode()

###Activity 9: Using the `sort_values()` Function

The `sort_values()` function arranges the numbers in a Pandas Series either in ascending or descending order.

In [None]:
#Arrange the number of pages in ascending order using the sort_values() function.
book_pages.sort_values()

Unnamed: 0,0
4,156
12,164
8,182
1,188
9,197
18,203
6,207
7,257
11,267
10,308


In [None]:
#Arrange the number of pages in descending order.
book_pages.sort_values(ascending=False)

Unnamed: 0,0
0,469
15,451
2,450
16,439
19,412
14,411
3,399
13,393
5,348
17,334


###Activity 10: Using the `median()` Function

In [None]:
#Find the median number of pages in the book_pages Series.
book_pages.median()

321.0

###Activity 11: Using the `value_counts()` Function

To count the number of occurrences of each item in a Pandas Series, use the `value_counts()` function.

In [None]:
#Count the number of times each page number occurs in the book_pages Series.

book_pages.value_counts()

The `value_counts()` function in Pandas is used to count the number of occurrences of each unique value in a Series. This is particularly useful when you want to understand the distribution of values in your dataset. Here, we'll explore value_counts() with different examples to see how it works in various scenarios.



In [None]:
#Example

# Create a list of fruits
fruits = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple', 'grape', 'banana']

# Convert the list to a Pandas Series
fruit_series = pd.Series(fruits)

# Use the value_counts() function
fruit_counts = fruit_series.value_counts()
print(fruit_counts)


apple     3
banana    3
orange    1
grape     1
Name: count, dtype: int64


**Note:** The `value_counts()` function is not available for Python lists and NumPy arrays.


##**Python List vs NumPy Array vs Pandas Series**

You might now be wondering when to use a Python list, a NumPy array, and a Pandas series. While there are no strict rules, here are some guidelines to help you decide:

**1. Python List:**

- Use a Python list when you want to store, retrieve, and add more data.
- Lists are flexible and can hold different data types, such as integers, strings, and objects.
- They are ideal for simple data storage and operations but may be less efficient for numerical computations.

**2. NumPy Array:**

- Use a NumPy array when you have numerical data (either one-dimensional or multidimensional) and need to perform many mathematical operations.
- NumPy arrays are faster than Python lists and are optimized for numerical computations.
- They support multi-dimensional arrays, making them suitable for scientific and engineering calculations.

**3. Pandas Series:**

- Use a Pandas series when you want to import data from an external file like TXT, XLSX, CSV, XML, etc.
- Pandas series are powerful for data analysis and manipulation.  They allow you to interpret data in different ways and perform complex data extraction, manipulation, and processing operations.
- Throughout this course, we will use the Pandas library to handle data due to its rich functionality and ease of use.

-----

#Pandas DataFrame

A DataFrame is essentially a table of data with rows and columns. Each column in a DataFrame is a Pandas Series. So we can also say  Pandas DataFrame is a collection of Pandas series. In other words, a Pandas DataFrame is a two-dimensional array.

Data in Frame(Rows and Columns)

To understand DataFrames better, we will use a dataset that records the heights and weights of individuals.


### Activity 1:  Creating DataFrame using different approaches
In this activity, we will learn how to create a DataFrame using different methods in Pandas. These methods include creating a DataFrame from a dictionary, a list of lists, and from an existing CSV file.

**Creating Empty DataFrame**

In [None]:
#Example
e_df = pd.DataFrame(   )
print(e_df)

**1. Creating a DataFrame from a Dictionary**

One common way to create a DataFrame is by using a dictionary. Each key in the dictionary represents a column name, and the corresponding value is a list of values for that column.

In [None]:
# Creating a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 32],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

# Creating a DataFrame from the dictionary
df_dict = pd.DataFrame(data)
print(df_dict)

      Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago
3    David   32      Houston


**2. Creating a DataFrame from a List of Lists**

Another way to create a DataFrame is by using a list of lists. In this approach, each inner list represents a row of data, and we can specify the column names separately.

In [None]:

# Creating a list of lists
data = [
    ['Alice', 24, 'New York'],
    ['Bob', 27, 'Los Angeles'],
    ['Charlie', 22, 'Chicago'],
    ['David', 32, 'Houston']
]

# Creating a DataFrame from the list of lists
df_list = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df_list)

      Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago
3    David   32      Houston


In [None]:
# Creating a list of lists
data = [
    ['Alice', 24, 'New York'],
    ['Bob', 27, 'Los Angeles'],
    ['Charlie', 22, 'Chicago'],
    ['David', 32, 'Houston']
]

# Creating a DataFrame from the list of lists
df_list = pd.DataFrame(data)
print(df_list)

**3. Creating a DataFrame from an Existing CSV File**

One of the most common ways to create a DataFrame is by reading data from a CSV file using the `read_csv()` function.

In [None]:
# create dataframe from csv file
df = pd.read_csv("/content/hw_200.csv")
df

Unnamed: 0,Index,"Height(Inches)""","""Weight(Pounds)"""
0,1,65.78,112.99
1,2,71.52,136.49
2,3,69.40,153.03
3,4,68.22,142.34
4,5,67.79,144.30
...,...,...,...
195,196,65.80,120.84
196,197,66.11,115.78
197,198,68.24,128.30
198,199,68.02,127.47


###Activity 2: Checking for Missing Values

Use the `isnull()` function to check for missing values in the DataFrame. Display the total number of missing values in each column.

In [None]:
#activity


###Activity 3: Slicing a DataFrame Using the `iloc[]` Function

Extract specific rows and columns from the DataFrame using the` iloc[]` function.

The **iloc** indexer in Pandas is a powerful tool used to select rows and columns from a DataFrame by their integer position (i.e., their index position). It stands for "integer location" and allows for both row and column indexing.

**Syntax:**

`dataframe_name.iloc[row_position_start : row_position_end, column_position_start : column_position_end]`

In this syntax:

- `row_position_start` denotes the position of the row in the DataFrame **starting** from whose values you want to take in the new Pandas series or DataFrame.
- `row_position_end` denotes the position of the row in the DataFrame till whose values you want to take in the new Pandas series or DataFrame.
- `column_position_start` denotes the position of the column in the DataFrame **starting** from whose values you want to take in the new Pandas series or DataFrame.
- `column_position_end` denotes the position of the column in the DataFrame till whose values you want to take in the new Pandas series or DataFrame.

In [None]:
# Create a simple DataFrame
data = {'A': [1, 2, 3, 4],
        'B': [5, 6, 7, 8],
        'C': [9, 10, 11, 12]}
df = pd.DataFrame(data)

print(df)

In [None]:
w=df.iloc[0]


In [None]:
df.iloc[0:3]

In [None]:
q=df.iloc[0:3,0:2]


In [None]:
df.iloc[:,0:2]

In [None]:
df.iloc[[0,2,3],[1,2]]