**Introduction to Pandas**
- What is Pandas?
Pandas is a tool in Python that helps you work with data. You can use it to:

1. Read and write data from files.
2. Clean and organize data.
3. Analyze and visualize data.

----

   - Installing Pandas
   To install Pandas in Python, follow these simple steps:

1. **Open your command prompt or terminal.**
2. **Type the following command:**
   ```sh
   pip install pandas
   ```
3. **Press Enter.**

- To install Pandas in a Jupyter Notebook, follow these steps:

1. **Open your Jupyter Notebook.**
2. **Create a new cell and type the following command:**
   ```python
   !pip install pandas
   ```
3. **Run the cell by pressing Shift + Enter.**

This will download and install Pandas on your computer.

----

   - Importing Pandas
   To import Pandas in Python, including in a Jupyter Notebook, follow these steps:

1. **Open your Jupyter Notebook.**
2. **Create a new cell and type the following command:**
   ```python
   import pandas as pd
   ```
3. **Run the cell by pressing Shift + Enter.**

----

### Let's Start Learning Pandas.

In [6]:
#import pandas
import pandas as pd

### Data Structures in Pandas
In Pandas, there are two main types of data structures:

- **Series**: Think of it as a single column of data, like a list with labels for each item.
  
- **DataFrame**: This is like a table with rows and columns, where each column can hold different types of data.

These structures help Pandas organize and work with data effectively, making tasks like data analysis and manipulation easier.

- **Creating DataFrames and Series**: 
  - You can create Series and DataFrames from lists, dictionaries, arrays, or even from reading data files like CSVs.
  This allows you to easily manipulate and analyze structured data.

## Series  (↓)

A **Series** in Pandas is like a list with labels for each item. Here's what you need to know:

- **Definition**: It's a one-dimensional array that can hold different types of data.
  
- **Indexing**: Each item in a Series has a label, starting from 0 by default.

- **Creation**: You can create a Series from lists, arrays, or dictionaries.

- **Operations**: You can perform operations like adding, slicing, and sorting data in a Series.

**Example:**

In [13]:
# Creating a Series from a list
data = [10, 20, 30, 40, 50]
s = pd.Series(data)
print(s)


0    10
1    20
2    30
3    40
4    50
dtype: int64


In Pandas Series, **names** and **labels** refer to different aspects of how data is organized and accessed:

### Names and Labels in Series:

1. **Names (`name` attribute)**:
   - The `name` attribute in a Series is used to give a name to the Series itself, not its individual elements.
   - It helps to provide a descriptive name to the Series, especially when it's part of a larger DataFrame.

In [16]:
 # Creating a Series with a name
s = pd.Series([1, 2, 3, 4, 5], name='MySeries')
print(s)
# Accessing the name attribute
print(s.name)

0    1
1    2
2    3
3    4
4    5
Name: MySeries, dtype: int64
MySeries


2. **Labels (Index labels)**:
   - Labels in a Series refer to the index labels or keys assigned to each element in the Series.
   - They allow you to access and manipulate specific data points using these labels.

   Example:

In [17]:
 # Creating a Series with custom index labels
 #  data = [10, 20, 30, 40, 50]
index = ['A', 'B', 'C', 'D', 'E']
s = pd.Series(data, index=index)
print(s)

A    10
B    20
C    30
D    40
E    50
dtype: int64



### Practical Uses:

- **Names**: Provide clarity and context to the Series, especially useful in multi-column DataFrames.
- **Labels**: Facilitate easy and efficient data access and manipulation using descriptive labels rather than numeric indices.

----

## DataFrame  (↓)

### DataFrame in Pandas Explained:

- **Definition**: A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It can be thought of as a table in a relational database or an Excel spreadsheet, where each column represents a different variable, and each row represents a different observation or entry.

- **Features**:
  - **Columns**: Each column in a DataFrame is a Pandas Series.
  - **Index**: DataFrame has both a row index and column labels, allowing for efficient data alignment and manipulation.
  - **Flexibility**: Columns can contain different types of data (integer, float, string, etc.).

### Creating DataFrames:
1. **From Dictionaries:**

In [20]:
# Creating a DataFrame from a dictionary
data = {'Name': ['John', 'Emma', 'Peter'], 'Age': [28, 24, 32]}
df = pd.DataFrame(data)
print(df)

    Name  Age
0   John   28
1   Emma   24
2  Peter   32


2. **From Lists of Lists:**

In [21]:
# Creating a DataFrame from a list of lists
data = [['John', 28], ['Emma', 24], ['Peter', 32]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)

    Name  Age
0   John   28
1   Emma   24
2  Peter   32


3. **From CSV Files:**

- Reading a DataFrame from a CSV file
- df = pd.read_csv('data.csv')


### Operations on DataFrames:

- **Accessing Elements**: Use `loc[]` or `iloc[]` for accessing elements by labels or integer location.
  
- **Slicing and Filtering**: Extract subsets of data based on conditions or positional indices.

- **Adding and Removing Columns**: Easily add or drop columns to manipulate data structure as needed.

### Practical Uses:

- **Data Analysis**: Perform statistical analysis, data visualization, and data manipulation tasks.
  
- **Data Cleaning**: Handle missing values, rename columns, and convert data types.

- **Data Integration**: Merge, join, or concatenate multiple DataFrames to integrate and analyze data from various sources.

----

**Data Importing and Exporting**
- Reading data from CSV, Excel, SQL, and other file types

### Reading Data from Different File Types:

#### 1. **CSV File**:
```python
import pandas as pd

# Reading from a CSV file
df_csv = pd.read_csv('data.csv')
```
---

#### 2. **Excel File**:
```python
# Reading from an Excel file
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Displaying the first few rows of the DataFrame
print(df_excel.head())
```
---

#### 3. **SQL Database**:
```python
from sqlalchemy import create_engine

# Establishing a connection to the SQL database
engine = create_engine('sqlite:///data.db')  # Replace 'sqlite:///data.db' with your database URL

# Reading data from a SQL query into a DataFrame
query = 'SELECT * FROM table_name;'
df_sql = pd.read_sql(query, con=engine)
```
---

#### 4. **Other Formats (JSON, HTML, etc.)**:
```python
# Reading from JSON file
df_json = pd.read_json('data.json')

# Reading from HTML (web scraping)
df_html = pd.read_html('https://example.com/data.html')[0]  # Assuming the table is the first one on the page
```

### Key Points:

- **CSV and Excel**: Use `pd.read_csv()` and `pd.read_excel()` respectively. You can specify additional parameters like `sheet_name` for Excel.
- **SQL Database**: Connect using SQLAlchemy (`create_engine`) and then use `pd.read_sql()` to execute SQL queries and read data into a DataFrame.
- **Other Formats**: Pandas supports various other formats such as JSON (`pd.read_json()`), HTML (`pd.read_html()` for web scraping), and more.

### Notes:

- Ensure you have the necessary libraries installed (`pandas`, `sqlalchemy` for SQL databases) to use these functions.
- Adjust parameters such as file paths, database URLs, and query strings (`query`) as per your specific data source and requirements.

This structured approach helps in understanding how to import data from different file types and sources into Pandas DataFrames effectively.

--------------------------

# - Writing data to CSV, Excel, SQL, and other file types

To write data from Pandas DataFrames to various file types such as CSV, Excel, SQL databases, and others, you can use different Pandas functions tailored for each file format. Here’s how you can write data to these common destinations:

When we talk about "writing" data from Pandas DataFrames, we're essentially exporting or saving the data to various file formats. Here’s how you can export data from Pandas DataFrames to different file types

keyword we using to export/writing file `df.to_csv` (csv is a file format u can add specific format according your data)

### Writing Data to Different File Types:

#### 1. **CSV File**:
```python
import pandas as pd

# Creating a sample DataFrame
data = {
    'Name': ['John', 'Emma', 'Peter'],
    'Age': [28, 24, 32],
    'City': ['New York', 'San Francisco', 'Chicago']
}
df = pd.DataFrame(data)

# Writing to a CSV file
df.to_csv('data.csv', index=False)
```
--------------

#### 2. **Excel File**:
```python
# Writing to an Excel file
df.to_excel('data.xlsx', sheet_name='Sheet1', index=False)
```
--------------

#### 3. **SQL Database**:
```python
from sqlalchemy import create_engine

# Establishing a connection to the SQL database
engine = create_engine('sqlite:///data.db')  # Replace 'sqlite:///data.db' with your database URL

# Writing data to a SQL table
df.to_sql('table_name', con=engine, index=False, if_exists='replace')
```
--------------

#### 4. **Other Formats (JSON, HTML, etc.)**:
```python
# Writing to JSON file
df.to_json('data.json', orient='records')

# Writing to HTML (convert DataFrame to HTML table)
df.to_html('data.html', index=False)
```

--------------

### Key Points:

- **CSV and Excel**: Use `df.to_csv()` and `df.to_excel()` respectively. Specify `index=False` to avoid writing row indices to the file.
- **SQL Database**: Connect using SQLAlchemy (`create_engine`) and then use `df.to_sql()` to write DataFrame contents to a SQL table. Specify `if_exists='replace'` to overwrite the table if it already exists.
- **Other Formats**: Pandas supports various other formats such as JSON (`df.to_json()`) and HTML (`df.to_html()`).

### Notes:

- Ensure you have the necessary libraries installed (`pandas`, `sqlalchemy` for SQL databases) to use these functions.
- Adjust parameters such as file paths, database URLs, and table names (`table_name`) as per your specific data destination and requirements.

By leveraging these Pandas functions, you can effectively export data from Pandas DataFrames to different file types and storage solutions for further use and sharing.

To view and understand data within a Pandas DataFrame, you can use several methods like `head()`, `tail()`, `info()`, and `describe()`:

You can specify the number of rows you want to view from a DataFrame using methods like `head()` or `tail()`. For example:

- **head(n)**: Displays the first `n` rows of the DataFrame.
- **tail(n)**: Displays the last `n` rows of the DataFrame.

### Explanation:
- **head(3)**: Displays the first 3 rows of the DataFrame `df`.
- **tail(2)**: Displays the last 2 rows of the DataFrame `df`.

These methods allow you to specify the number of rows you want to inspect from the beginning or end of your DataFrame, providing flexibility in data exploration.

### Viewing Data in Pandas:

#### 1. **head()**:
- **Usage**: Displays the first 5 rows of the DataFrame.
- **Example**: `df.head()`
- **Description**: Useful for quickly inspecting the structure and content at the beginning of the DataFrame.

#### 2. **tail()**:
- **Usage**: Shows the last 5 rows of the DataFrame.
- **Example**: `df.tail()`
- **Description**: Helps in checking the structure and content at the end of the DataFrame.

#### 3. **info()**:
- **Usage**: Provides a concise summary of the DataFrame including column names, data types, and memory usage.
- **Example**: `df.info()`
- **Description**: Useful for understanding the overall structure and data types present in the DataFrame.

#### 4. **describe()**:
- **Usage**: Generates descriptive statistics that summarize the central tendency, dispersion, and shape of numerical columns.
- **Example**: `df.describe()`
- **Description**: Offers insights into the numerical aspects of the DataFrame, such as mean, standard deviation, and quartile information.

### Summary:

- **head()** and **tail()**: Quick methods to preview the beginning and end of your DataFrame.
- **info()**: Provides a concise summary of DataFrame details.
- **describe()**: Generates statistical insights into numerical data in the DataFrame.

These methods are essential for initial data exploration, providing a comprehensive view of the DataFrame's structure and content.

----------------

### DataFrame Attributes

In Pandas, you can access several attributes of a DataFrame to understand its structure and characteristics:

- **shape**: Gives the dimensions (rows, columns) of the DataFrame.
- **columns**: Returns the column labels of the DataFrame.
- **index**: Provides information about the index (row labels) of the DataFrame.
- **dtypes**: Displays the data types of each column in the DataFrame.

These attributes help you quickly retrieve essential information about the DataFrame's layout, column names, index details, and data types.

### Summary:

- **shape**: Returns a tuple representing the dimensions (rows, columns) of the DataFrame.
- **columns**: Provides the names of columns in the DataFrame.
- **index**: Shows the row labels or index of the DataFrame.
- **dtypes**: Displays the data types of columns in the DataFrame.

These attributes are fundamental for understanding and manipulating data in Pandas, facilitating efficient data exploration and analysis.

In [35]:
# Creating a sample DataFrame
data = {
    'Name': ['John', 'Emma', 'Peter'],
    'Age': [28, 24, 32],
    'City': ['New York', 'San Francisco', 'Chicago']
}
df = pd.DataFrame(data)
# Print the DataFrame itself
print("\nDataFrame:")
print(df)

# Printing DataFrame attributes
print("Shape of the DataFrame:")
print(df.shape)
print("\nColumns of the DataFrame:")
print(df.columns)
print("\nIndex of the DataFrame:")
print(df.index)
print("\nData types of columns:")
print(df.dtypes)

#RUN THIS CELL TO SEE EXAMPLE


DataFrame:
    Name  Age           City
0   John   28       New York
1   Emma   24  San Francisco
2  Peter   32        Chicago
Shape of the DataFrame:
(3, 3)

Columns of the DataFrame:
Index(['Name', 'Age', 'City'], dtype='object')

Index of the DataFrame:
RangeIndex(start=0, stop=3, step=1)

Data types of columns:
Name    object
Age      int64
City    object
dtype: object


----

# *Understanding Summary Statistics in Pandas*

### Summary Statistics with `describe()`

In Pandas, you can use the `describe()` method to get a quick overview of numerical data in your DataFrame. Here's what it provides:

- **Count**: Number of non-null values in each numerical column.
- **Mean**: Average (mean) value of each numerical column.
- **Std**: Standard deviation, which measures the amount of variation or dispersion in each column.
- **Min**: The smallest value in each column.
- **25%, 50%, 75%**: Percentiles that indicate the distribution of data. For example, the 50th percentile (median) is the middle value of the dataset.
- **Max**: The largest value in each column.

In [37]:
### Example Usage:
# Creating a sample DataFrame

data = {
    'Name': ['John', 'Emma', 'Peter', 'Alice', 'Bob'],
    'Age': [28, 24, 32, 30, 35],
    'Height (inches)': [68, 65, 70, 67, 72],
    'Weight (kg)': [70, 65, 80, 75, 85]
}
df = pd.DataFrame(data)

# Generating summary statistics
summary = df.describe()

# Printing the summary statistics
print("Summary Statistics:")
print(summary)

Summary Statistics:
             Age  Height (inches)  Weight (kg)
count   5.000000         5.000000     5.000000
mean   29.800000        68.400000    75.000000
std     4.147288         2.701851     7.905694
min    24.000000        65.000000    65.000000
25%    28.000000        67.000000    70.000000
50%    30.000000        68.000000    75.000000
75%    32.000000        70.000000    80.000000
max    35.000000        72.000000    85.000000


### Explanation:

- **describe()**: Provides essential statistics for numerical columns in the DataFrame.
- **Summary**: The output of `describe()` includes count, mean, standard deviation, minimum value, percentiles (25%, 50%, 75%), and maximum value for each numerical column.

This method is useful for quickly understanding the distribution and central tendencies of your numerical data, aiding in data exploration and initial insights into your dataset.

----

# **Data Selection and Filtering**

### Selecting Columns and Rows in Pandas

In Pandas, you can select specific columns and rows using different methods:

- **loc[]**: Selects data by labels (index and column names).
- **iloc[]**: Selects data by integer location (index position).
- **at[]**: Accesses a single value using labels (index and column names).
- **iat[]**: Accesses a single value using integer location (index position).

In [38]:
### Example Usage:

# Creating a sample DataFrame
data = {
    'Name': ['John', 'Emma', 'Peter', 'Alice', 'Bob'],
    'Age': [28, 24, 32, 30, 35],
    'City': ['New York', 'San Francisco', 'Chicago', 'Boston', 'Seattle']
}
df = pd.DataFrame(data)

# Using loc[] to select specific rows and columns by labels
print("Using loc[]:")
print(df.loc[1:3, ['Name', 'Age']])  # Select rows 1 to 3 and columns 'Name' and 'Age'
print(df.loc[df['Age'] > 30])  # Select rows where Age is greater than 30

# Using iloc[] to select specific rows and columns by integer location
print("\nUsing iloc[]:")
print(df.iloc[1:3, [0, 1]])  # Select rows 1 to 2 and columns at integer positions 0 and 1

# Using at[] to access a single value by labels
print("\nUsing at[]:")
print(df.at[2, 'Name'])  # Access the value at row label 2 and column label 'Name'

# Using iat[] to access a single value by integer location
print("\nUsing iat[]:")
print(df.iat[3, 2])  # Access the value at row position 3 and column position 2


Using loc[]:
    Name  Age
1   Emma   24
2  Peter   32
3  Alice   30
    Name  Age     City
2  Peter   32  Chicago
4    Bob   35  Seattle

Using iloc[]:
    Name  Age
1   Emma   24
2  Peter   32

Using at[]:
Peter

Using iat[]:
Boston


### Explanation:

- **loc[]**: Selects rows and columns based on their labels. You can specify rows and columns using labels directly or conditionally.
- **iloc[]**: Selects rows and columns based on their integer positions (0-based index).
- **at[]**: Accesses a single value using labels for both row and column.
- **iat[]**: Accesses a single value using integer positions for both row and column.

These methods provide flexibility in accessing and manipulating data in Pandas DataFrames based on either labels or integer positions, depending on your data analysis needs.

----

### Boolean Indexing in Pandas

Boolean indexing in Pandas allows you to select rows based on a condition expressed as a boolean (True or False). Here’s how you can use it:

### Example Usage:

In [39]:
# Creating a sample DataFrame
data = {
    'Name': ['John', 'Emma', 'Peter', 'Alice', 'Bob'],
    'Age': [28, 24, 32, 30, 35],
    'City': ['New York', 'San Francisco', 'Chicago', 'Boston', 'Seattle']
}
df = pd.DataFrame(data)

# Boolean indexing: Select rows where Age is greater than 30
print("Boolean indexing:")
print(df[df['Age'] > 30])

Boolean indexing:
    Name  Age     City
2  Peter   32  Chicago
4    Bob   35  Seattle


### Explanation:

- **Boolean Indexing**: The expression `df['Age'] > 30` creates a boolean Series where each element indicates whether the condition (Age > 30) is True or False.
- **df[df['Age'] > 30]**: This selects rows from the DataFrame where the condition is True, effectively filtering out rows where Age is not greater than 30.

Boolean indexing is useful for filtering data based on specific conditions, allowing you to focus on subsets of your data that meet certain criteria. It's a powerful feature for data manipulation and analysis in Pandas.

----

# Setting Values in Pandas

In Pandas, you can set values in a DataFrame using various methods, depending on whether you want to modify specific rows, columns, or cells. Here are common methods:

### Example Usage:

In [40]:
# Creating a sample DataFrame
data = {
    'Name': ['John', 'Emma', 'Peter', 'Alice', 'Bob'],
    'Age': [28, 24, 32, 30, 35],
    'City': ['New York', 'San Francisco', 'Chicago', 'Boston', 'Seattle']
}
df = pd.DataFrame(data)

# Setting values in a DataFrame
# 1. Setting values for an entire column
df['Age'] = [30, 25, 33, 31, 36]  # Update all values in the 'Age' column

# 2. Setting values based on a condition
df.loc[df['Name'] == 'Emma', 'City'] = 'Los Angeles'  # Update 'City' for rows where 'Name' is 'Emma'

# 3. Setting a single value using index and column labels
df.at[0, 'City'] = 'Chicago'  # Set the value in the first row and 'City' column

# Printing the updated DataFrame
print("Updated DataFrame:")
print(df)

Updated DataFrame:
    Name  Age         City
0   John   30      Chicago
1   Emma   25  Los Angeles
2  Peter   33      Chicago
3  Alice   31       Boston
4    Bob   36      Seattle


### Explanation:

- **Setting Values**:
  - **Setting values for an entire column**: You can assign a list or array to a column to update all its values.
  - **Setting values based on a condition**: Use `loc[]` to set values based on a condition (e.g., updating 'City' where 'Name' is 'Emma').
  - **Setting a single value**: Use `at[]` to set a single value at a specific row and column intersection.

These methods allow you to efficiently update and modify data in Pandas DataFrames, whether for bulk updates or specific cell assignments based on conditions.