# Pandas

Pandas is a powerful Python library for data manipulation and analysis. It provides high-performance data structures and data analysis tools that make working with data efficient and easy.

Here are some of the key reasons why you might use pandas:

1. Data Cleaning and Preparation:
   * Handling missing values: Pandas offers functions to identify and handle missing data, such as filling them with specific values or removing rows or columns containing missing data.
   * Data normalization: You can standardize data to a common scale, which is often necessary for certain machine learning algorithms.
   * Data transformation: Pandas provides tools to transform data, such as converting data types, grouping data, and aggregating values.

2. Data Analysis:
   * Descriptive statistics: Calculate summary statistics like mean, median, mode, standard deviation, and correlation coefficients to gain insights into your data.
   * Data visualization: Pandas integrates well with visualization libraries like Matplotlib and Seaborn, allowing you to create informative charts and graphs to explore your data visually.
   * Data filtering and querying: Easily filter data based on specific criteria and query dataframes to extract relevant information.

3. Data Manipulation:
   * Data merging and joining: Combine data from multiple sources into a single dataset.
   * Data reshaping: Restructure dataframes, such as pivoting or stacking data, to suit your analysis needs.
   * Time series analysis: Pandas provides tools for working with time series data, including date and time handling, time-based indexing, and time series analysis techniques.

4. Integration with Other Libraries:
   * Seaborn: Create visually appealing statistical plots with ease.
   * Scikit-learn: Perform machine learning tasks, such as classification, regression, and clustering.
   * Statsmodels: Conduct statistical modeling and hypothesis testing.

5. Efficiency and Performance:
   * Optimized data structures: Pandas' data structures are designed for efficient data manipulation and analysis.
   * Vectorized operations: Perform operations on entire datasets at once, leading to improved performance.

## Series

* One-dimensional array: A Series is essentially a one-dimensional labeled array that can hold any data type (integers, floats, strings, objects, etc.).
* Labels: Each element in a Series is associated with a label, which can be any immutable object (e.g., integers, strings). These labels are often referred to as the Series' index.
* Creation:

```python
import pandas as pd

# Creating a Series from a list
s = pd.Series([1, 2, 3, 4])

# Creating a Series with custom index
s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
```

## DataFrames

* Two-dimensional labeled data structure: A DataFrame is essentially a collection of Series, where each Series represents a column and shares a common index.
* Rows and Columns: DataFrames are organized into rows and columns, with each row representing a record and each column representing a feature.
* Creation:

```python
# Creating a DataFrame from a dictionary
data = {'col1': [1, 2, 3], 'col2': ['a', 'b', 'c']}
df = pd.DataFrame(data)

# Creating a DataFrame from a list of lists
data = [[1, 'a'], [2, 'b'], [3, 'c']]
df = pd.DataFrame(data, columns=['col1', 'col2'])
```

## Key Operations:

* Selection: Accessing specific elements or subsets of data using labels or integer-based indexing.
* Indexing: Creating new Series or DataFrames based on specific conditions or labels.
* Filtering: Selecting rows or columns based on logical conditions.
* Aggregation: Calculating summary statistics (e.g., mean, median, sum) for Series or DataFrames.
* Grouping and Aggregation: Grouping data by specific columns and applying aggregate functions to each group.
* Joining and Merging: Combining multiple DataFrames based on common columns or indexes.
* Reshaping: Transforming the structure of DataFrames (e.g., pivoting, stacking, unstacking).

## Series vs DataFrame

### Data Structures

- Series: A one-dimensional labeled array of values.
- DataFrame: A two-dimensional labeled data structure with columns of potentially different types.


| Feature | Series | DataFrame |
|---------|--------|-----------|
| Dimensionality | 1-dimensional | 2-dimensional |
| Data structure | Labeled array | Tabular spreadsheet-like structure |
| Typical use | Single column of data | Multiple columns of data |
| Index | Single index | Row index and column labels |
| Data types | Can hold data of any single type | Can hold multiple data types across columns |
| Creation | From a list, array, or dictionary | From a dictionary of Series, list of dictionaries, or 2D numpy array |
| Shape | (n,) where n is the number of elements | (n, m) where n is the number of rows and m is the number of columns |
| Selection | Single brackets `[]` | Single `[]` or double `[[]]` brackets |
| Column operations | N/A (is a single column) | Can add, remove, or modify columns |
| Vectorized operations | Applied to entire Series | Applied to entire DataFrame or specific columns |
| Use case | Representing a single feature or time series | Representing a complete dataset with multiple features |


## Functions

### Series Methods

| Category            | Method                 | Description                           | Parameters                                                                                          |
|---------------------|-----------------------|---------------------------------------|-----------------------------------------------------------------------------------------------------|
| **Creation**        | `Series()`            | Create a new Series                   | data, index, dtype                                                                                  |
| **Manipulation**    | `append()`            | Append values to the Series           | other, ignore_index                                                                                 |
|                     | `drop()`              | Drop specified labels from the Series | labels, errors                                                                                      |
|                     | `drop_duplicates()`   | Remove duplicate values               | keep, take_last                                                                                     |
|                     | `fillna()`            | Fill missing values                   | value, method, limit                                                                                |
|                     | `reindex()`           | Reindex the Series                    | index, method, fill_value, limit                                                                    |
|                     | `rename()`            | Rename the Series                     | index, axis                                                                                         |
| **Analysis**        | `apply()`             | Apply a function to the Series        | func, args, kwargs                                                                                  |
|                     | `describe()`          | Generate descriptive statistics       | percentiles, include, exclude                                                                       |
|                     | `unique()`            | Return the unique values              | -                                                                                                   |
|                     | `value_counts()`      | Return the count of unique values     | normalize, sort, ascending, bins, dropna                                                            |
| **Statistical**     | `max()`               | Return the maximum value              | -                                                                                                   |
|                     | `mean()`              | Return the mean value                 | -                                                                                                   |
|                     | `median()`            | Return the median value               | -                                                                                                   |
|                     | `min()`               | Return the minimum value              | -                                                                                                   |
|                     | `std()`               | Return the standard deviation         | -                                                                                                   |
|                     | `var()`               | Return the variance                   | -                                                                                                   |
| **Aggregation**     | `sum()`               | Return the sum                        | -                                                                                                   |
|                     | `quantile()`          | Return the quantile value             | q                                                                                                   |
|                     | `rank()`              | Rank the values                       | axis, method, na_option, ascending                                                                  |
| **Transformation**  | `astype()`            | Cast the Series to a specified dtype  | dtype                                                                                               |
|                     | `map()`               | Map values to a new Series            | arg                                                                                                 |
|                     | `interpolate()`       | Interpolate missing values            | method, axis, limit, inplace                                                                        |
|                     | `to_frame()`          | Convert the Series to a DataFrame     | name                                                                                                |
| **Input/Output**    | `to_csv()`            | Write the Series to a csv file        | path, sep, na_rep, header, index, index_label, mode                                                 |
|                     | `to_excel()`          | Write the Series to an excel file     | excel_writer, sheet_name, na_rep, header, index, index_label                                        |
|                     | `to_json()`           | Convert the Series to a json string   | path, orient, date_format, double_precision, force_ascii, date_unit, default_handler                |
|                     | `to_string()`         | Convert the Series to a string        | buf, na_rep, formatters, float_format, header, index, index_label                                   |
|                     | `to_list()`           | Convert the Series to a list          | -                                                                                                   |
|                     | `to_numpy()`          | Convert the Series to a numpy array   | dtype                                                                                               |



---

### DataFrame Methods

| Category            | Method                 | Description                           | Parameters                                                                                          |
|---------------------|-----------------------|---------------------------------------|-----------------------------------------------------------------------------------------------------|
| **Creation**        | `DataFrame()`         | Create a new DataFrame                | data, index, columns, dtype                                                                         |
| **Manipulation**    | `append()`            | Append rows to the DataFrame          | other, ignore_index, verify_integrity                                                               |
|                     | `drop()`              | Drop specified labels from the DataFrame | labels, axis, errors                                                                                |
|                     | `drop_duplicates()`   | Remove duplicate rows                 | subset, keep, take_last, inplace                                                                     |
|                     | `dropna()`            | Remove missing values                 | axis, how, thresh, subset, inplace                                                                  |
|                     | `rename()`            | Rename the DataFrame columns          | columns                                                                                              |
| **Analysis**        | `apply()`             | Apply a function to the DataFrame     | func, axis, args, kwargs                                                                            |
|                     | `describe()`          | Generate descriptive statistics       | percentiles, include, exclude                                                                       |
|                     | `duplicated()`        | Return boolean Series denoting duplicate rows | subset, keep                                                                                         |
| **Statistical**     | `count()`             | Count the number of non-NA values     | axis, level                                                                                         |
|                     | `corr()`              | Compute pairwise correlation          | method, min_periods                                                                                 |
|                     | `cov()`               | Compute pairwise covariance           | min_periods                                                                                         |
| **Cumulative**      | `cummax()`            | Return the cumulative maximum         | axis, skipna                                                                                        |
|                     | `cummin()`            | Return the cumulative minimum         | axis, skipna                                                                                        |
|                     | `cumsum()`            | Return the cumulative sum             | axis, skipna                                                                                        |
| **Transformation**  | `astype()`            | Cast the DataFrame to a specified dtype | dtype                                                                                               |
|                     | `applymap()`          | Apply a function to each element      | func                                                                                                |
|                     | `clip()`              | Clip the values                       | lower, upper                                                                                        |
|                     | `combine()`           | Combine the DataFrame with another    | other, func, axis, level                                                                            |
|                     | `diff()`              | Compute the difference                | periods, axis                                                                                       |
| **Input/Output**    | `to_csv()`            | Write the DataFrame to a csv file     | path, sep, na_rep, header, index, index_label, mode                                                 |
|                     | `to_excel()`          | Write the DataFrame to an excel file   | excel_writer, sheet_name, na_rep, header, index, index_label                                        |
|                     | `to_json()`           | Convert the DataFrame to a json string | path, orient, date_format, double_precision, force_ascii, date_unit, default_handler                |
|                     | `to_string()`         | Convert the DataFrame to a string      | buf, na_rep, formatters, float_format, header, index, index_label                                   |



In [1]:
import pandas as pd

data = [10,20,30,40,50]

series = pd.Series(data)
print(series)
print(series.sum())
print(series[series <20])

series = pd.Series(data, index=['a', 'b', 'c', 'd', 'e' ])
print(series)

data_dict = {
    'apple' :3,
    'banana' :5,
    'cherry' :7
}
series = pd.Series(data_dict)
print(series)


series = pd.Series(10, index=['a', 'b', 'c'])
print(series)
print(series['b'])
print(series[1])

0    10
1    20
2    30
3    40
4    50
dtype: int64
150
0    10
dtype: int64
a    10
b    20
c    30
d    40
e    50
dtype: int64
apple     3
banana    5
cherry    7
dtype: int64
a    10
b    10
c    10
dtype: int64
10
10


  print(series[1])


In [15]:
import numpy as np
import pandas as pd

# 1. Create DataFrame from dictionary
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 24, 35, 32]
}
print('Dictionary')
df1 = pd.DataFrame(data)  # Create DataFrame from dictionary

# 2. Create DataFrame from list of lists
data = [
    ['John', 28], 
    ['Anna', 24], 
    ['Peter', 35], 
    ['Linda', 32]
]
print('List')
df2 = pd.DataFrame(data, columns=['Name', 'Age'])  # Create DataFrame from list of lists

# 3. Create DataFrame from NumPy array
data = np.array([['John', 28], ['Anna', 24], ['Peter', 35], ['Linda', 32]])
print('Numpy Array')
df3 = pd.DataFrame(data, columns=['Name', 'Age'])  # Create DataFrame from NumPy array

# 4. Create DataFrame from Excel file
# print('Excel')
# df4 = pd.read_excel('data.xlsx')  # Create DataFrame from Excel file (uncomment and replace with your file path)

# 5. Create DataFrame from CSV file
# print('CSV')
# df5 = pd.read_csv('data.csv')  # Create DataFrame from CSV file (uncomment and replace with your file path)

# Operations on DataFrames
df = df1  # Use df1 for demonstration

# 1. Select rows and columns
print('\nSpecific Column')
print(df[['Name', 'Age']])  # Select specific columns
print('\nSpecific ROw - loc')
print(df.loc[0])  # Select first row
print('\nSpecific ROw - iloc')
print(df.iloc[0])  # Select first row by integer location

# 2. Filter rows
print('\nFilter - df[df["Age"] > 30]')
print(df[df['Age'] > 30])  # Filter rows where Age > 30

# 3. Group and aggregate
print("\nGroupyby - df[df['Age'] > 30]")
print(df.groupby('Name').sum())  # Group by Name and calculate sum of Age

# 4. Sort and index
print("\nSort - df.sort_values('Age')")
print(df.sort_values('Age'))  # Sort by Age in ascending order
print("\nSet index- df.set_index('Name')")
print(df.set_index('Name'))  # Set Name as index

# 5. Merge and join
df2 = pd.DataFrame({
    'Name': ['John', 'Anna', 'Peter'], 
    'Country': ['USA', 'UK', 'Germany']}
)
print("\nMergeing - pd.merge(df, df2, on='Name')")
print(pd.merge(df, df2, on='Name'))  # Merge on Name column

# 6. Handle missing data
df['Age'] = df['Age'].replace(28, np.nan)
print("\nFill NA - df.fillna(0)")
print(df.fillna(0))  # Fill NaN with 0

# 7. Dataframe statistics
print(df.describe())  # Summary statistics
print(df.info())  # DataFrame information

Dictionary
List
Numpy Array

Specific Column
    Name  Age
0   John   28
1   Anna   24
2  Peter   35
3  Linda   32

Specific ROw - loc
Name    John
Age       28
Name: 0, dtype: object

Specific ROw - iloc
Name    John
Age       28
Name: 0, dtype: object

Filter - df[df["Age"] > 30]
    Name  Age
2  Peter   35
3  Linda   32

Groupyby - df[df['Age'] > 30]
       Age
Name      
Anna    24
John    28
Linda   32
Peter   35

Sort - df.sort_values('Age')
    Name  Age
1   Anna   24
0   John   28
3  Linda   32
2  Peter   35

Set index- df.set_index('Name')
       Age
Name      
John    28
Anna    24
Peter   35
Linda   32

Mergeing - pd.merge(df, df2, on='Name')
    Name  Age  Country
0   John   28      USA
1   Anna   24       UK
2  Peter   35  Germany

Fill NA - df.fillna(0)
    Name   Age
0   John   0.0
1   Anna  24.0
2  Peter  35.0
3  Linda  32.0
             Age
count   3.000000
mean   30.333333
std     5.686241
min    24.000000
25%    28.000000
50%    32.000000
75%    33.500000
max    35.0

### Assignment 1: ###
Reading rows and columns from a  CSV file


```python
import pandas as pd

# 1. Basic method to read the entire CSV file
df1 = pd.read_csv('people_data.csv')
print("Basic Method:\n", df1)

# 2. Read only specific columns from the CSV
df2 = pd.read_csv('people_data.csv', usecols=['Name', 'City'])
print("\nRead Specific Columns:\n", df2)

# 3. Read CSV file without a header
df3 = pd.read_csv('people_data.csv', header=None)
print("\nRead Without Header:\n", df3)

# 4. Read CSV file with a custom delimiter (semicolon in this case)
df4 = pd.read_csv('people_data.csv', sep=';')
print("\nRead with Custom Delimiter (semicolon):\n", df4)

# 5. Read large CSV files in chunks
chunks = pd.read_csv('people_data.csv', chunksize=100)
print("\nReading in Chunks:")
for chunk in chunks:
    print(chunk)

# 6. Skip specific rows when reading the file (skipping first 2 rows here)
df5 = pd.read_csv('people_data.csv', skiprows=2)
print("\nSkip First 2 Rows:\n", df5)

# 7. Handling missing values by filling or replacing them
df6 = pd.read_csv('people_data.csv', na_values=["NaN", "None"]).fillna("Unknown")
print("\nHandling Missing Values:\n", df6)
```

In [None]:
import csv
import pandas as pd
filename = 'people_data.csv'
header = ['Name', 'Age', 'City', 'Occupation']
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 24, 35, 32],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
    'Occupation': ['Engineer', 'Designer', 'Doctor', 'Lawyer']
}

with open(filename , mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(header)
    for i in range(len(data['Name'])):
        writer.writerow([data['Name'][i], data['Age'][i], data['City'][i], data['Occupation'][i]])
print("CSV file created successfully!\n")


dataframe = pd.read_csv(filename, header=0)
print(dataframe)

CSV file created successfully!

    Name  Age         City Occupation
0   John   28     New York   Engineer
1   Anna   24  Los Angeles   Designer
2  Peter   35      Chicago     Doctor
3  Linda   32      Houston     Lawyer


### Assignment 2: ###

Create a CSV file to hold student data, roll no, name, class, marks.

Read this data into a dictionary

Inspect the data using head(), tail(), info(), describe()

In [None]:
import csv
import pandas as pd

filename = 'students_data.csv'
students_data = {
    'Roll No': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eva', 'Frank', 'Grace', 'Hank', 'Ivy', 'Jack'],
    'Class': ['10th', '12th', '11th', '10th', '12th', '11th', '10th', '12th', '11th', '10th'],
    'Marks': [85, 90, 80, 88, 92, 80, 86, 89, 77, 91]
}
header = ['Roll No', 'Name', 'Class', 'Marks']
with open(filename , mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(header)
    for i in range(len(students_data['Roll No'])):
        writer.writerow([students_data['Roll No'][i], students_data['Name'][i], students_data['Class'][i], students_data['Marks'][i]])
print("Student CSV file created successfully!")

dataframe = pd.read_csv(filename, header=0)
dataframe['Marks']=dataframe['Marks'].astype(int)
print('\nHead')
print(dataframe.head())
print('\nTail')
print(dataframe.tail())
print('\nInfo')
print(dataframe.info())
print('\nDescribe ')
print(dataframe.describe())

Student CSV file created successfully!

Head
   Roll No     Name Class  Marks
0      101    Alice  10th     85
1      102      Bob  12th     90
2      103  Charlie  11th     80
3      104    Diana  10th     88
4      105      Eva  12th     92

Tail
   Roll No   Name Class  Marks
5      106  Frank  11th     80
6      107  Grace  10th     86
7      108   Hank  12th     89
8      109    Ivy  11th     77
9      110   Jack  10th     91

Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Roll No  10 non-null     int64 
 1   Name     10 non-null     object
 2   Class    10 non-null     object
 3   Marks    10 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 448.0+ bytes
None

Describe 
         Roll No      Marks
count   10.00000  10.000000
mean   105.50000  85.800000
std      3.02765   5.202563
min    101.00000  77.000000
25%    103.25000  81.2500

## Outliers

- Outliers are data points that significantly differ from other observations in a dataset
- They can occur due to variability in the measurement or indicate experimental errors
- Outliers may provide valuable insights or skew statistical analyses, depending on the context

### Characteristics of Outliers

- Extreme values that deviate from the pattern of the majority of the data
- Can be unusually high (upper outliers) or low (lower outliers)
- May affect measures of central tendency and dispersion

### Outlier Detection: Interquartile Range (IQR) Method

- Identifies outliers by defining boundaries based on the 25th percentile (Q1) and the 75th percentile (Q3)
- Steps to calculate:
  1. Calculate Q1 (25th percentile) and Q3 (75th percentile)
  2. Calculate IQR: IQR = Q3 - Q1
  3. Define lower bound: Q1 - 1.5 * IQR
  4. Define upper bound: Q3 + 1.5 * IQR
  5. Any data point outside these bounds is considered an outlier

### Importance of Outlier Analysis

- Helps in data cleaning and preprocessing
- Improves accuracy of statistical analyses and machine learning models
- Can reveal important anomalies or special cases in the data

### Caution

- Not all outliers are errors; some may represent genuine extreme values
- Context of the data and domain knowledge is crucial in interpreting outliers
- Removal of outliers should be done carefully and documented thoroughly

### Example
```python
# Outliers Example
import pandas as pd

# Sample data
data = {'Values': [10, 12, 12, 13, 12, 14, 10, 13, 13, 400]}

# Create DataFrame
df = pd.DataFrame(data)

# Calculate Q1, Q3, and IQR
Q1 = df['Values'].quantile(0.25)
Q3 = df['Values'].quantile(0.75)
IQR = Q3 - Q1
print('Q1 : ', Q1)
print('Q3 : ', Q3)
print('IQR : ', IQR)

# Define bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print('lower_bound : ', lower_bound)
print('upper_bound : ', upper_bound)


# Identify outliers
outliers = df[(df['Values'] < lower_bound) | (df['Values'] > upper_bound)]
print("Outliers:\n", outliers)
```

```console
Q1 :  12.0
Q3 :  13.0
IQR :  1.0
lower_bound :  10.5
upper_bound :  14.5
Outliers:
    Values
0      10
6      10
9     400
```

The rank function in pandas!
The rank function in pandas is used to assign ranks to the values in a Series or DataFrame. It can be used to rank the values in ascending or descending order.
Syntax:
df.rank(method='average', axis=0, na_option='keep', ascending=True)
Parameters:

* method: Method to use for ranking. Options are:
    * average: Average rank of group (default).
    * min: Lowest rank of group.
    * max: Highest rank of group.
    * first: ranks assigned in order of appearance.
    * dense: Like 'min', but rank always increases by 1 between groups.
* axis: Axis to rank on. (0 or 'index' for rows, 1 or 'columns' for columns).
* na_option: How to handle NaN values. Options are:
    * keep (default): Keep NaN values as is.
    * top: Assign highest rank to NaN values.
    * bottom: Assign lowest rank to NaN values.
* ascending: Whether to rank in ascending (True) or descending (False) order.

Example Use Cases:

    Ranking values in a Series:

```Python

import pandas as pd

s = pd.Series([10, 20, 30, 40, 50])
ranked_s = s.rank()
print(ranked_s)
```
Output
```console
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64
```
    Ranking values in a DataFrame:

```Python

df = pd.DataFrame({'A': [10, 20, 30], 'B': [40, 50, 60]})
ranked_df = df.rank()
print(ranked_df)
```
Output
```console
   A  B
0  1.0  1.0
1  2.0  2.0
2  3.0  3.0
```
    Ranking in descending order:

```Python

df = pd.DataFrame({'A': [10, 20, 30]})
ranked_df = df.rank(ascending=False)
print(ranked_df)
```
Output
```console
   A
0  3.0
1  2.0
2  1.0
```
    Ranking with NaN values:

```Python

s = pd.Series([10, np.nan, 30])
ranked_s = s.rank()
print(ranked_s)
```
Output
```console
0    1.0
1    NaN
2    2.0
dtype: float64
```

In [None]:
import pandas as pd
data = pd.Series([30,60,10,20,20,30,40])

print("Average rank :")
print(data.rank(method="average"))

print("Min rank :")
print(data.rank(method="min"))

print("Max rank :")
print(data.rank(method="max"))

print("First rank :")
print(data.rank(method="first"))

print("Dense rank :")
print(data.rank(method="dense"))

Average rank :
0    4.5
1    7.0
2    1.0
3    2.5
4    2.5
5    4.5
6    6.0
dtype: float64
Min rank :
0    4.0
1    7.0
2    1.0
3    2.0
4    2.0
5    4.0
6    6.0
dtype: float64
Max rank :
0    5.0
1    7.0
2    1.0
3    3.0
4    3.0
5    5.0
6    6.0
dtype: float64
First rank :
0    4.0
1    7.0
2    1.0
3    2.0
4    3.0
5    5.0
6    6.0
dtype: float64
Dense rank :
0    3.0
1    5.0
2    1.0
3    2.0
4    2.0
5    3.0
6    4.0
dtype: float64


### Selection in pd.DataFrame ###

In [18]:

import pandas as pd

students_data = {
    'Roll No': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eva', 'Frank', 'Grace', 'Hank', 'Ivy', 'Jack'],
    'Class': ['10th', '12th', '11th', '10th', '12th', '11th', '10th', '12th', '11th', '10th'],
    'Score': [85, 90, 80, 88, 92, 80, 86, 89, 77, 91]
}

df = pd.DataFrame(students_data)

print('\ndf')
print(df)
print("\ndf[['Name', 'Score']]")
print(df[['Name', 'Score']])
print("\ndf.iloc[0]")                             
print(df.iloc[0])                             # Access the first row by index position
print("\ndf.iloc[:3]")                            
print(df.iloc[:3])                            # Access the first 3 rows
print("\ndf.iloc[1:4, [0, 2]]")                   
print(df.iloc[1:4, [0, 2]])                   # Access rows 1 to 3 and specific columns
print("\ndf.iloc[2]")                             
print(df.iloc[2])                             # Access row at index 2 by position
print("\ndf.loc[df['Score'] > 90]")               
print(df.loc[df['Score'] > 90])               # Filter rows where Score is greater than 90
print("\ndf.loc[1:3, ['Name', 'Score']]")         
print(df.loc[1:3, ['Name', 'Score']])         # Access a subset of rows and columns using labels
print("\ndf[df['Roll No'].isin([104,105,109])]")  
print(df[df['Roll No'].isin([104,105,109])])  # Filter based on Roll No
print("\ndf.at[0,'Score']")                       
print(df.at[0,'Score'])                       # Access a specific element by label (row 0, 'Score')
print("\ndf.at[1,'Name']")                        
print(df.at[1,'Name'])                        # Access by label (row 1, 'Name')


df
   Roll No     Name Class  Score
0      101    Alice  10th     85
1      102      Bob  12th     90
2      103  Charlie  11th     80
3      104    Diana  10th     88
4      105      Eva  12th     92
5      106    Frank  11th     80
6      107    Grace  10th     86
7      108     Hank  12th     89
8      109      Ivy  11th     77
9      110     Jack  10th     91

df[['Name', 'Score']]
      Name  Score
0    Alice     85
1      Bob     90
2  Charlie     80
3    Diana     88
4      Eva     92
5    Frank     80
6    Grace     86
7     Hank     89
8      Ivy     77
9     Jack     91

df.iloc[0]
Roll No      101
Name       Alice
Class       10th
Score         85
Name: 0, dtype: object

df.iloc[:3]
   Roll No     Name Class  Score
0      101    Alice  10th     85
1      102      Bob  12th     90
2      103  Charlie  11th     80

df.iloc[1:4, [0, 2]]
   Roll No Class
1      102  12th
2      103  11th
3      104  10th

df.iloc[2]
Roll No        103
Name       Charlie
Class         11th
Score

### Assignment ###
* Test data loading and retrieval to perform selection operations

* Load dataset: display the first 5 rows to understand the data structure.
    * Use: pd.read_csv()

* Select Specific Columns: Extract the columns Invoice ID, Product Line, Total, display first 10 rows.

* Filter Data by Conditions: Select rows where the City is "Yangon' and the Payment method is 'Ewallet. Display first 5 results.

* Select Rows by Index Position: Display first 5 rows and the first 3 columns.
    * Use: iloc

* Select Rows by Label: Display rows from index 5 to 10 and columns Invoice ID, Branch, and Gender.
    * Use: loc

In [None]:
import pandas as pd

filename = 'random.csv'
df = pd.read_csv(filename, header=0)
print('#1')
print(df.head())
print('#2')
print(df.loc[0:10, ['name', 'phone', 'country']])
print('#3')
print(df.loc[(df['country'] == 'Chile') & (df['numberrange'] < 10)].head(5))
print('#4')
print(df.iloc[0:5, 0:5])
print('#4')
print(df.loc[0:5, ['name', 'phone', 'country' ]])

#1
             name           phone                          email  \
0  Erasmus Mathis  1-771-425-6636        lobortis.mauris@aol.edu   
1  Alika Delacruz  1-712-233-4154            leo@protonmail.couk   
2     Kane Conley  1-164-866-1122  ipsum.leo.elementum@yahoo.net   
3   Minerva Haley  1-211-352-0675          dui.lectus@google.edu   
4   Brennan Hines  (750) 757-3230       duis.gravida@outlook.com   

                     address postalZip         region      country  \
0   Ap #830-2768 Odio Avenue    824256    Amur Oblast        Chile   
1     863-6677 Tristique Av.    817334           Luik    Australia   
2  P.O. Box 485, 1574 Ac St.     36965  Waals-Brabant      Germany   
3           3372 Gravida Av.    922257        Limburg  New Zealand   
4    Ap #778-6041 Semper Rd.     77496          Cusco    Singapore   

                                                text  numberrange currency  
0  nec tempus mauris erat eget ipsum. Suspendisse...            9   $69.05  
1  ultrices, 

### Dataframe Manipulation ###

Here’s a table explaining the differences between the `melt()` and `pivot()` functions in Python's pandas, along with examples to clearly illustrate how they work.

| **Aspect**          | **`melt()`**                                                                                                                                     | **`pivot()`**                                                                                                                |
|---------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
| **Function Purpose** | Converts a wide DataFrame into a long format, turning columns into rows.                                                                         | Converts a long DataFrame into a wide format, turning rows into columns.                                                     |
| **Use Case**         | When you want to "unpivot" a DataFrame, i.e., convert multiple columns into a single column of values and a corresponding column for variables.  | When you want to "pivot" a DataFrame by turning unique row values into new columns.                                           |
| **Key Parameters**   | - `id_vars`: Columns to keep intact (not melted).<br>- `value_vars`: Columns to unpivot.                                                        | - `index`: Columns to keep as the index.<br>- `columns`: Column whose unique values will become new columns.<br>- `values`: Columns to fill in the new DataFrame. |
| **Example Input**    | {'Name': ['John', 'Anna'], 'Math': [90, 85], 'Science': [80, 95]}     | pd.DataFrame({'Name': ['John', 'Anna'], 'Subject': ['Math', 'Science'], 'Score': [90, 80]}           |
| **Example Operation**| pd.melt(df, id_vars=['Name'], value_vars=['Math', 'Science'], var_name='Subject', value_name='Score') | df.pivot(index='Name', columns='Subject', values='Score')```              |


In [21]:
import pandas as pd

# Creating a sample dataframe
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
        'Age': [28, 24, 35, 32],
        'City': ['New York', 'Paris', 'Berlin', 'London']}
df = pd.DataFrame(data)

# Displaying the original dataframe
print("Original DataFrame:")
print(df)

# **Selecting Data**

# Selecting a single column
print("\nSelecting a single column:")
print(df['Name'])

# Selecting multiple columns
print("\nSelecting multiple columns:")
print(df[['Name', 'Age']])

# Selecting rows by index
print("\nSelecting rows by index:")
print(df.loc[0])

# Selecting rows by condition
print("\nSelecting rows by condition:")
print(df[df['Age'] > 30])

# **Filtering Data**

# Filtering rows by condition
print("\nFiltering rows by condition:")
print(df[df['City'] == 'Berlin'])

# **Sorting Data**

# Sorting by a single column
print("\nSorting by a single column:")
print(df.sort_values('Age'))

# Sorting by multiple columns
print("\nSorting by multiple columns:")
print(df.sort_values(['Age', 'Name']))

# **Grouping Data**

# Grouping by a single column
print("\nGrouping by a single column:")
print(df.groupby('City')['Name'].count())

# Grouping by multiple columns
print("\nGrouping by multiple columns:")
print(df.groupby(['City', 'Age'])['Name'].count())

# **Merging Data**

# Creating another dataframe
data2 = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
         'Country': ['USA', 'France', 'Germany', 'UK']}
df2 = pd.DataFrame(data2)

# Merging two dataframes
print("\nMerging two dataframes:")
print(pd.merge(df, df2, on='Name'))

# **Pivoting Data**

# Pivoting a dataframe
print("\nPivoting a dataframe:")
print(df.pivot_table(index='City', values='Age', aggfunc='mean'))

# **Melting Data**

# Melting a dataframe
print("\nMelting a dataframe:")
print(pd.melt(df, id_vars='Name', value_vars='Age'))

# **Handling Missing Data**

# Creating a dataframe with missing values
data3 = {'Name': ['John', 'Anna', 'Peter', None],
         'Age': [28, 24, None, 32]}
df3 = pd.DataFrame(data3)

# Filling missing values
print("\nFilling missing values:")
print(df3.fillna(0))

# Dropping missing values
print("\nDropping missing values:")
print(df3.dropna())

# **Transposing Data**

# Transposing a dataframe
print("\nTransposing a dataframe:")
print(df.T)

# **Renaming Columns**

# Renaming columns
print("\nRenaming columns:")
print(df.rename(columns={'Name': 'Full Name'}))

# **Dropping Columns**

# Dropping columns
print("\nDropping columns:")
print(df.drop('City', axis=1))

# **Adding New Columns**

# Adding new columns
print("\nAdding new columns:")
df['Country'] = ['USA', 'France', 'Germany', 'UK']
print(df)


# **Dropping Duplicate Rows**

# Dropping duplicate rows
print("\nDropping duplicate rows:")
df.drop_duplicates(inplace=True)
print(df)

# **Resetting Index**

# Resetting index
print("\nResetting index:")
print(df.reset_index(drop=True))

# **Applying Functions**

# Applying a function to a column
print("\nApplying a function to a column:")
df['Age'] = df['Age'].apply(lambda x: x*2)
print(df)

# **Sorting by Multiple Columns**

# Sorting by multiple columns
print("\nSorting by multiple columns:")
print(df.sort_values(['Age', 'Name']))

# **Finding Unique Values**

# Finding unique values in a column
print("\nFinding unique values in a column:")
print(df['City'].unique())

# **Finding Value Counts**

# Finding value counts in a column
print("\nFinding value counts in a column:")
print(df['City'].value_counts())

# **Finding Missing Values**

# Finding missing values in a column
print("\nFinding missing values in a column:")
print(df['Age'].isnull().sum())

# **Replacing Values**

# Replacing values in a column
print("\nReplacing values in a column:")
df['City'] = df['City'].replace('Berlin', 'Munich')
print(df)

# **Extracting Date Components**

# Extracting date components from a datetime column
print("\nExtracting date components from a datetime column:")
df['Date'] = pd.to_datetime('2022-01-01')
print(df['Date'].dt.day)

# **Merging Dataframes with Different Columns**

# Merging dataframes with different columns
print("\nMerging dataframes with different columns:")
data4 = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
         'Country': ['USA', 'France', 'Germany', 'UK']}
df4 = pd.DataFrame(data4)
print(pd.merge(df, df4, on='Name', how='outer'))

# **Reshaping Data**

# Original dataframe
print("\nOriginal dataframe:")
print(df)
# Reshaping data from wide to long
print("\nReshaping data from wide to long:")
print(pd.melt(df, id_vars='Name', value_vars='Age'))

# Reshaping data from long to wide
print("\nReshaping data from long to wide:")
print(df.pivot_table(index='Name', values='Age', aggfunc='mean').reset_index())

Original DataFrame:
    Name  Age      City
0   John   28  New York
1   Anna   24     Paris
2  Peter   35    Berlin
3  Linda   32    London

Selecting a single column:
0     John
1     Anna
2    Peter
3    Linda
Name: Name, dtype: object

Selecting multiple columns:
    Name  Age
0   John   28
1   Anna   24
2  Peter   35
3  Linda   32

Selecting rows by index:
Name        John
Age           28
City    New York
Name: 0, dtype: object

Selecting rows by condition:
    Name  Age    City
2  Peter   35  Berlin
3  Linda   32  London

Filtering rows by condition:
    Name  Age    City
2  Peter   35  Berlin

Sorting by a single column:
    Name  Age      City
1   Anna   24     Paris
0   John   28  New York
3  Linda   32    London
2  Peter   35    Berlin

Sorting by multiple columns:
    Name  Age      City
1   Anna   24     Paris
0   John   28  New York
3  Linda   32    London
2  Peter   35    Berlin

Grouping by a single column:
City
Berlin      1
London      1
New York    1
Paris       1
Nam

### Handling Missing Values ###

In [None]:
import pandas as pd
import numpy as np

# Create a sample DataFrame with missing values
data = {'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8]}
df = pd.DataFrame(data)
print(df)

# Methods to handle missing values
df = pd.DataFrame(data)
print("1. Fill missing values with a specific value:")
print(df.fillna(value=0))

df = pd.DataFrame(data)
print("\n2. Fill missing values forward or backward:")
print(df.fillna(method='ffill'))  # Forward fill
print(df.fillna(method='bfill'))  # Backward fill

df = pd.DataFrame(data)
print("\n3. Fill missing values using interpolation:")
print(df.interpolate(method='linear'))  # Linear interpolation
print(df.interpolate(method='quadratic'))  # Quadratic interpolation
print(df.interpolate(method='polynomial', order=2))  # Polynomial interpolation
print(df.interpolate(method='spline', order=2))  # Spline interpolation

df = pd.DataFrame(data)
print("\n4. Fill missing values with custom functions:")
print(df.fillna(value=df.mean()))  # Mean of the column
print(df.fillna(value=df.mode().iloc[0]))  # Most frequent value
print(df.apply(lambda x: x.fillna(x.interpolate(method='spline', order=2)), axis=0))  # Custom interpolation

df = pd.DataFrame(data)
print("\n5. Drop rows or columns containing missing values:")
print(df.dropna())  # Drop rows with missing values
print(df.dropna(how='all'))  # Drop rows where all values are missing
print(df.dropna(how='any'))  # Drop rows where any values are missing
print(df.dropna(axis=1))  # Drop columns with missing values

df = pd.DataFrame(data)
print("\n6. Fill missing values based on conditions:")
condition = df['A'].isnull()
df_filled = df.where(~condition, other=df['A'].mean())  # Replace missing values with the mean
print(df_filled)

df = pd.DataFrame(data)
print("\n7. Other methods:")
print(df.replace(to_replace=np.nan, value=0))  # Replace missing values with a specific value
print(df.interpolate(method='linear', limit=2))  # Interpolate missing values with a limit

     A    B
0  1.0  5.0
1  2.0  NaN
2  NaN  7.0
3  4.0  8.0
1. Fill missing values with a specific value:
     A    B
0  1.0  5.0
1  2.0  0.0
2  0.0  7.0
3  4.0  8.0

2. Fill missing values forward or backward:
     A    B
0  1.0  5.0
1  2.0  5.0
2  2.0  7.0
3  4.0  8.0
     A    B
0  1.0  5.0
1  2.0  7.0
2  4.0  7.0
3  4.0  8.0

3. Fill missing values using interpolation:
     A    B
0  1.0  5.0
1  2.0  6.0
2  3.0  7.0
3  4.0  8.0
     A    B
0  1.0  5.0
1  2.0  6.0
2  3.0  7.0
3  4.0  8.0
     A    B
0  1.0  5.0
1  2.0  6.0
2  3.0  7.0
3  4.0  8.0
     A    B
0  1.0  5.0
1  2.0  6.0
2  3.0  7.0
3  4.0  8.0

4. Fill missing values with custom functions:
          A         B
0  1.000000  5.000000
1  2.000000  6.666667
2  2.333333  7.000000
3  4.000000  8.000000
     A    B
0  1.0  5.0
1  2.0  5.0
2  1.0  7.0
3  4.0  8.0
     A    B
0  1.0  5.0
1  2.0  6.0
2  3.0  7.0
3  4.0  8.0

5. Drop rows or columns containing missing values:
     A    B
0  1.0  5.0
3  4.0  8.0
     A    B
0  1.0 

  print(df.fillna(method='ffill'))  # Forward fill
  print(df.fillna(method='bfill'))  # Backward fill


### Assignment ###
Cleaning Vaccination dataset. Use "country_vaccinations.csv" from Kaggle

* Load Dataset: Display the first 5 rows to get an overview of the data structure. Use: pd.read_csv()

* Add new column 'vaccination_rate' percentage of population vaccinated with at least one dose (use a constant population estimate)

* Modify 'daily_vaccinations' column by filling any missing values with mean of column.

* Delete 'source_name' column

* Sort DataFrame by 'total_vaccinations' in descending order to find countries with highest number of vaccinations.

* Sort DataFrame by country and date to organize data by country and chronological order

* Rename column 'iso_code' to 'country_code'

* Rename 'index' to start from 1 instead of 0.


* Handling Missing Data:

    * Identify columns with missing data and count the number of missing values in each column.
    
    * Use fillna() to fill missing values in 'people_fully_vaccinated' with 0, assuming no data indicates no full vaccinations.
    
    * Drop any rows where 'total_vaccinations' is missing.
    
* Replace any occurrences of "Moderna, Pfizer/BioNTech" in the vaccines column with "mRNA vaccines".

* Replace negative values in 'daily_vaccinations' (if any) with 0, assuming these are data entry errors.

In [None]:
import pandas as pd

filename = 'country_vaccinations.csv'
df = pd.read_csv(filename, header=0)
print(df.keys())
print(df.head())

df['vaccination_rate'] = (df['people_fully_vaccinated'] - df['people_fully_vaccinated']) / 200000
df['daily_vaccinations'] = df['daily_vaccinations'].fillna(df['daily_vaccinations'].mean())
df.drop('source_name', axis=1, inplace=True )
df.sort_values('total_vaccinations', ascending = False, inplace=True)
df.sort_values(['country', 'date'], inplace=True )
df.rename(columns={'iso_code':'country_code'}, inplace=True)
df.index = pd.Index(range(1, len(df) + 1))
df.isnull().sum()
df['people_fully_vaccinated'].fillna(0, inplace = True)
df.dropna(subset=['total_vaccinations'])
df['vaccines'].replace({"Moderna, Pfizer/BioNTech":"mRNA vaccines"}, inplace=True)
df['daily_vaccinations'].apply(lambda x:0 if x < 0 else x)

Index(['country', 'iso_code', 'date', 'total_vaccinations',
       'people_vaccinated', 'people_fully_vaccinated',
       'daily_vaccinations_raw', 'daily_vaccinations',
       'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred',
       'people_fully_vaccinated_per_hundred', 'daily_vaccinations_per_million',
       'vaccines', 'source_name', 'source_website'],
      dtype='object')
       country iso_code        date  total_vaccinations  people_vaccinated  \
0  Afghanistan      AFG  2021-02-22                 0.0                0.0   
1  Afghanistan      AFG  2021-02-23                 NaN                NaN   
2  Afghanistan      AFG  2021-02-24                 NaN                NaN   
3  Afghanistan      AFG  2021-02-25                 NaN                NaN   
4  Afghanistan      AFG  2021-02-26                 NaN                NaN   

   people_fully_vaccinated  daily_vaccinations_raw  daily_vaccinations  \
0                      NaN                     NaN        

Unnamed: 0,daily_vaccinations
1,114971.789486
2,1367.000000
3,1367.000000
4,1367.000000
5,1367.000000
...,...
31236,18598.000000
31237,23205.000000
31238,27567.000000
31239,30698.000000


### Analyzing Retail Sales Data with Pandas ###

* Loading data and Inspection

    * Load data from CSV file into a Pandas DataFrame.
    * Display the first 10 rows of the data to get an overview.
    * Check for any missing values and handle them appropriately.

* Data Cleaning and Preparation:

    * Convert the Date column to datetime format.
    * Remove any duplicate entries if present.
    * Handle outliers in the Sales column using the IQR method by replacing them with the median sales value.

* Sales Analysis:

    * Calculate the total sales for each store and display the top 5 stores by sales.
    * Find out which product category has the highest average sales.
    * Identify the best-selling product in terms of quantity sold.

* Outlier Detection:

    * Use the IQR method to detect outliers in the Quantity column.

* Ranking Analysis:

    * Rank the products by total sales in descending order and display the top 10 products.
    * Rank the stores by their average sales amount using the rank() method with the 'min' ranking method for ties.


In [None]:
import pandas as pd

url = "https://drive.google.com/uc?id=1D5Wrc2-ufRZwuviJKn7t8tvW13IOholf"

# Task 1.1
df  = pd.read_csv(url, header=0)

#Task 1.2
print("Data samples : \n",df.head(10))
"""
      Date    Store    Product    Category   Sales  Quantity
0  29-01-2023  Store_D  Product_3  Category_2  430.24        11
1  09-10-2023  Store_B  Product_1  Category_2  212.26        18
2  09-08-2023  Store_C  Product_1  Category_1  538.42         8
3  03-05-2023  Store_B  Product_2  Category_2  670.34         9
4  08-11-2023  Store_A  Product_3  Category_3  562.97        17
"""

#Task 1.3
print("Null Values : \n",df.isnull().sum())
df.ffill(inplace=True)

# Task 2.1
df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')

# Task 2.2
df.drop_duplicates(inplace=True)

# Task 2.3
Q1 = df['Sales'].quantile(0.25)
Q3 = df['Sales'].quantile(0.75)
IQR = Q3 - Q1
print('Q1 : ', Q1)
print('Q3 : ', Q3)
print('IQR : ', IQR)

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print('lower_bound : ', lower_bound)
print('upper_bound : ', upper_bound)

# outliers = df[(df['Sales'] < lower_bound) | (df['Sales'] > upper_bound)]
df.loc[(df['Sales'] < lower_bound) | (df['Sales'] > upper_bound), 'Sales'] = df['Sales'].median()
print("Outliers Removed :\n",df.head())

# 3.1
total_sales_by_store = df.groupby('Store')['Sales'].sum()
top_5_stores = total_sales_by_store.sort_values(ascending=False).head(5)
print("Highest store sales :\n",top_5_stores)

# 3.2
avg_sales_by_category = df.groupby('Category')['Sales'].mean()
highest_avg_sales_category = avg_sales_by_category.sort_values(ascending=False).index[0]
print("Highest cat sales :\n",highest_avg_sales_category)

# 3.3
best_selling_product = df.groupby('Product')['Quantity'].sum().idxmax()
print("Best selling product :\n",best_selling_product)

# 4.1
print('For Quantity')
Q1 = df['Quantity'].quantile(0.25)
Q3 = df['Quantity'].quantile(0.75)
IQR = Q3 - Q1
print('Q1 : ', Q1)
print('Q3 : ', Q3)
print('IQR : ', IQR)

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print('lower_bound : ', lower_bound)
print('upper_bound : ', upper_bound)

df.loc[(df['Quantity'] < lower_bound) | (df['Quantity'] > upper_bound), 'Quantity'] = df['Quantity'].median()

# 5.1
total_sales_by_product = df.groupby('Product')['Sales'].sum()
top_10_products = total_sales_by_product.sort_values(ascending=False).head(10)
print("TOp 10 Products :\n", top_10_products)

# 5.2
avg_sales_by_store = df.groupby('Store')['Sales'].mean()
avg_sales_by_store_ranked = avg_sales_by_store.rank(method='min', ascending=False)
print("Average sales by store :\n", avg_sales_by_store_ranked)

Data samples : 
          Date    Store    Product    Category   Sales  Quantity
0  29-01-2023  Store_D  Product_3  Category_2  430.24        11
1  09-10-2023  Store_B  Product_1  Category_2  212.26        18
2  09-08-2023  Store_C  Product_1  Category_1  538.42         8
3  03-05-2023  Store_B  Product_2  Category_2  670.34         9
4  08-11-2023  Store_A  Product_3  Category_3  562.97        17
5  27-05-2023  Store_E  Product_3  Category_1  251.85        12
6  09-04-2023  Store_C  Product_4  Category_3  660.08         8
7  09-02-2023  Store_D  Product_4  Category_2  388.39         7
8  17-05-2023  Store_D  Product_2  Category_1  164.68         9
9  03-10-2023  Store_E  Product_5  Category_2  424.08         1
Null Values : 
 Date        0
Store       0
Product     0
Category    0
Sales       0
Quantity    0
dtype: int64
Q1 :  229.6675
Q3 :  698.245
IQR :  468.5775
lower_bound :  -473.19875
upper_bound :  1401.11125
Outliers Removed :
         Date    Store    Product    Category   Sa

### Hands on

#### Assignment 1:

Lists of employee details: Employee name, their ID, company name, and salary are stored in different lists.

1. **Task 1**: Create the first DataFrame using the `pandas` library with the help of employee name and ID.
   
2. **Task 2**: Create the second DataFrame using the `pandas` library with the help of employee ID, company name, and salary.
   
3. **Task 3**: Generate the final DataFrame by merging the first and second DataFrames to print the employee ID, employee name, company name, and salary.
   - **Hint**: Use the merge operation in `pandas`.

In [22]:
import pandas as pd

employee_names = ['Alice', 'Bob', 'Charlie']
employee_ids = [101, 102, 103]
company_names = ['Company A', 'Company B', 'Company C']
salaries = [50000, 60000, 70000]

df1 = pd.DataFrame({'Employee Name': employee_names, 'Employee ID': employee_ids})

df2 = pd.DataFrame({'Employee ID': employee_ids, 'Company Name': company_names, 'Salary': salaries})

final_df = pd.merge(df1, df2, on='Employee ID')

print(final_df)

  Employee Name  Employee ID Company Name  Salary
0         Alice          101    Company A   50000
1           Bob          102    Company B   60000
2       Charlie          103    Company C   70000
