**Introduction to Pandas**

- Why we need to use pandas if we can analyze data through excel?

- As we ananlyze our data in excel (Analyzing data in Excel means examining and interpreting data using tools and functions to find patterns, trends, and insights)

-  2 Example: Analyzing a company's data on losses and profits over time helps identify trends, such as whether profits are increasing or losses are decreasing. This analysis can guide decisions on how to improve financial performance

- 3 Example: Analyzing data on the number of students enrolled in a course over time helps identify trends, such as whether enrollment is increasing     decreasing. This analysis can guide decisions on how to improve course offerings

- 4 Example: Analyzing data on the number of customers who have purchased a product over time helps identify trends, such as whether sales are increasing or decreasing. 
- This analysis can guide decisions on how to improve marketing strategies

**What is problem with excel then ? If we can analyze data in excel too?**
  1. *The main problem with excel is that While Excel is powerful for many tasks, its programmatic capabilities are limited compared to Python*
  *VBA is available, but it's generally considered more complex and less versatile than Python and pandas for data manipulation*
  *Visual Basic for Applications. It's a programming language developed by Microsoft that is used for automation and scripting within Microsoft Office applications, such as Excel, Word, and Access.*

  2. *Another problem is that Excel is not designed for large-scale data analysis. 
     *It can become slow and unresponsive when dealing with large datasets*
     *We can't do much with excel when we have large data sets*




**Benefits of Using Pandas**
1. *Pandas is a powerful and flexible library for data manipulation and analysis in Python*

2. *We can read columns and bring them to data frames. Such as if we have following data in rows and columns of excel sheet.
    1. Name---------------Subject----------T.Marks-----Obt.Marks
    2. Alice--------------English------------100 ------------80
    3. Merry--------------English------------100-------------88
    4. Katherine----------English------------100-------------88

*Pandas read this data and put it in RAM and after that we perform analysis. ( & we know that Accessing data from RAM is faster than accessing from disc.)Thats why the computation speeds-up.*

*Pandas is designed for large-scale data analysis and can handle massive datasets with ease*

*Pandas provides data structures and functions to efficiently handle structured data, including tabular data such as  spreadsheets and SQL tables, and time series data*

*Pandas is highly extensible and can be used in conjunction with other popular data science libraries in  Python, such as NumPy, Matplotlib, and Scikit-learn*

*Pandas is highly scalable and can handle large datasets with ease*


**"scalable" means that pandas can handle large datasets more efficiently than Excel. Pandas is designed to manage and process large volumes of data effectively. It provides better performance and more memory management features, which makes it suitable for handling big data and complex analyses that might be cumbersome in Excel. However, for extremely large datasets, even pandas might need additional tools or approaches, such as using Dask or integrating with databases.**

# Pandas Data Structures
> Primarily theer are 2 types of data structures in pandas

**Series:** A one-dimensional labeled array that can hold any data type. It represents a single column or row of data.
          --> A Pandas Series is like a column in a table

**DataFrame:** A two-dimensional labeled array with columns that can hold different data types. It represents a tabular structure with rows and columns, similar to a spreadsheet.


**Labled Array**
*A labeled array is an array where each element is associated with a unique label or index, which can be customized (e.g., strings, dates) rather than just integers starting from 0. This allows for more flexible and meaningful data access compared to arrays with fixed integer indices as in Numpy arrays*

In [8]:
#Creating a Simple Pandas Series from a List
import pandas as pd
L=[1,8,3,4,5]
sr=pd.Series(L)
print(f"{sr}\n")
#If nothing else is specified, the values are labeled with their index number. First value has index 0, 
# second value has index 1 etc. This label can be used to access a specified value.
print(sr[1])

#We Can Create Our Own Lables using index Argument

L2=[100,200,300,400,500]
s=pd.Series(L2,index=['hundred','2-Hundred','3-Hundred','4-Hundred','5-Hundred'])
print(f"\n{s}\n") 
#Accessing the value using its lable
print(s['5-Hundred'])


0    1
1    8
2    3
3    4
4    5
dtype: int64

8

hundred      100
2-Hundred    200
3-Hundred    300
4-Hundred    400
5-Hundred    500
dtype: int64

500


**Using Key-Value Object (Dictionary) when creating a series**

In [9]:
import pandas as pd
temperature=pd.Series({
    "Monday": 22, 
    "Tuesday": 25,
    "Wednesday": 23, 
    "Thursday": 45, 
    "Friday": 26
})
print(temperature)

print(f"\n {temperature["Thursday"]}")


#In a Series, keys of the dictionary become the labels.. Each value in the Series is associated with a unique index label.
# In contrast, in a DataFrame, the keys are the column names.


Monday       22
Tuesday      25
Wednesday    23
Thursday     45
Friday       26
dtype: int64

 45


# Creating DataFrames
*Data sets in Pandas are usually multi-dimensional tables, called DataFrames.*
> *Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.*

> *Series is like a column, a DataFrame is the whole table.*

**In a dictionary used to create a pandas DataFrame, the keys represent the column names.**
**The values associated with these keys are the data that populate the columns of the DataFrame.**

In [10]:
#Creating DataFrames using 2 Series
import pandas as pd # type: ignore
product={
   "Prices":1000,
   "Sales": 100
}

#Loading the data in data frame
Df=pd.DataFrame(product,index=[0,1])
print(Df)
print("\n")
#The attribute to check dtype of dataframe is "dtypes" not "dtype"
print(Df.dtypes)

   Prices  Sales
0    1000    100
1    1000    100


Prices    int64
Sales     int64
dtype: object


### Locate Rows
> loc attribute is used to return one or more specified rows

> Using .iloc

> .loc uses labels (index names) to select data.

> .iloc uses integer positions (index positions) to select data.

**loc:**
> Access Rows: Primarily used to access rows and columns by labels.

> **Boolean Indexing:** Can be used with boolean conditions to filter rows.

> **Slicing:** Supports slicing based on labels. For example, Df.loc['A':'C'] selects rows from label 'A' to 'C'.

### Boolean indexing 
> It is a method to filter data based on conditions that evaluate to True or False.
### Integer-Based: Filtering based on integer values.
> Example: df[df['Age'] > 30]
### String-Based: Filtering based on string values or patterns.
> Example: df[df['Name'].str.startswith('A')]

### Mixed Data Types: Combining multiple conditions involving different data types.
> Example: df[(df['Age'] > 30) & (df['Name'].str.contains('a'))]


**.iloc:**
> Access Rows and Columns: Used to access rows and columns by integer positions.

> **Slicing:** Supports slicing based on integer positions.
> For example, Df.iloc[0:3] selects rows from position 0 to 2 



In [11]:
import pandas as pd
product2={
   "Prices":[1000,200000,400000,666000,376666],
   "Sales": [100,200,400,666,376]
}

#Loading the data in data frame
Df=pd.DataFrame(product2)
#It will return a pandas series
print(Df.loc[3])

print("\n")
#It will return a DataFrame
print(Df.loc[[0,2,3]])


#In a DataFrame: print(df[0]) gives an error because you cannot use integer indexing directly to access columns or rows
# We should use .loc or .iloc for accessing rows or columns.
# In a Series: print(sr[0]) works because Series allows direct integer indexing to access elements.


Prices    666000
Sales        666
Name: 3, dtype: int64


   Prices  Sales
0    1000    100
2  400000    400
3  666000    666


> .loc can use boolean indexing.

>  Boolean indexing allows us to filter rows based on conditions

In [12]:
import pandas as pd

# Create a dictionary
product2 = {
    "Prices": [1000, 200000, 400000, 666000, 376666],
    "Sales": [100, 200, 400, 666, 376]
}

# Load the data into a DataFrame with custom index labels
Df = pd.DataFrame(product2, index=['A', 'B', 'C', 'D', 'E'])

# Boolean indexing example: Select rows where the Prices are greater than 100000
print("Using .loc with boolean indexing:")
print(Df.loc[Df['Prices'] > 100000])


Using .loc with boolean indexing:
   Prices  Sales
B  200000    200
C  400000    400
D  666000    666
E  376666    376


In [13]:
import pandas as pd  # type: ignore

#Scenario: You have a DataFrame of employee records, and you want to access data based on specific labels (like employee IDs or names).
data = {
    "EmployeeID": [101, 102, 103, 104],
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Department": ["HR", "Finance", "IT", "Marketing"]
}
df = pd.DataFrame(data)

# Accessing data by label using .loc (requires EmployeeID to be set as index)
df = df.set_index("EmployeeID")  # Temporary set EmployeeID as index

print(df.loc[101:105])  # Fetches data for employee with ID 103


               Name Department
EmployeeID                    
101           Alice         HR
102             Bob    Finance
103         Charlie         IT
104           David  Marketing


In [14]:
import pandas as pd
#Scenario: You have a DataFrame of daily sales data and you want to access data based on integer-based positions (e.g., the first 5 days of data).
#.iloc is beneficial when you need to access rows and columns by their integer positions, which is useful for tasks such as slicing 
# data or selecting specific rows and columns by their index positions in a zero-based index system.
# Creating a DataFrame of daily sales
data = {
    "Date": ["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04", "2024-01-05"],
    "Sales": [500, 600, 700, 800, 900]
}
df = pd.DataFrame(data)

#Accessing rows by integer position using .iloc
print(df.iloc[0:3])  # Fetches data for the first 3 rows


         Date  Sales
0  2024-01-01    500
1  2024-01-02    600
2  2024-01-03    700
