# Usage Of Pandas In Data Science:
                Pandas is a powerful and versatile data manipulation library in Python, widely used in data science for its ability to handle and analyze structured data. Pandas’ flexibility, combined with its ability to handle large datasets efficiently, makes it an essential tool in the data science ecosystem, enabling professionals to perform complex data operations with ease.

For using pandas firstly we have to import the pandas package as we want.Here i'm importing it as pd:

In [1]:
import pandas as pd

There are 2 primary data structures in pandas.They are

1.Series

2.Dataframe

## Series:
Seres is a powerful data structure for handling one-dimensional data. A pandas Series is similar to a column in a spreadsheet or a database table. It can hold data of any type, including integers, floats, strings, and more, and it comes with an associated index to label the data.
##### Basic Structure
Data: The core data of a Series, which can be of any type (integers, floats, strings, etc.).

Index: The labels associated with the data. By default, the index is a range of integers starting from 0, but you can set custom indices.

### Creating a Series:
You can create a Series using various methods:

##### From a List:

In [2]:
data = [10, 20, 30, 40]
series = pd.Series(data)
print(series)

0    10
1    20
2    30
3    40
dtype: int64


##### From a Dictionary:

In [3]:
data_dict = {'a': 10, 'b': 20, 'c': 30}
series_dict = pd.Series(data_dict)
print(series_dict)

a    10
b    20
c    30
dtype: int64


##### From a Scalar Value:

In [4]:
scalar_series = pd.Series(5, index=['a', 'b', 'c'])
print(scalar_series)

a    5
b    5
c    5
dtype: int64


##### With Custom Index:

In [12]:
data = [100, 200, 300]
index = ['first', 'second', 'third']
series = pd.Series(data, index=index)
print(series)

first     100
second    200
third     300
dtype: int64


### Indexing and Accessing Data:
Indexing always starts with 0 and increases by 1 while moving from left to right .

##### Accessing Elements:

In [5]:
print(series[1]) 

20


##### Accessing with Labels:

In [6]:
print(series_dict['b']) 

20


##### Slicing:

In [7]:
print(series[1:3])  

1    20
2    30
dtype: int64


##### Boolean Indexing:
You can filter values based on conditions:

In [8]:
print(series[series > 20])  

2    30
3    40
dtype: int64


##  Vectorized Operations and Functions:
Pandas Series supports vectorized operations, allowing you to perform operations on all elements without needing to loop through them individually.

##### 1.Scalar Operations

In [9]:
# Multiply each element by 2
print(series * 2)

0    20
1    40
2    60
3    80
dtype: int64


##### 2.Applying Functions

In [10]:
squared_series = series.apply(lambda x: x ** 2)
print(squared_series)

0     100
1     400
2     900
3    1600
dtype: int64


Here lambda function is a small, anonymous function defined using the lambda keyword. Lambda functions are useful for creating short, throwaway functions without having to define a full function using def. They are often used for simple operations that are needed temporarily.

### Modifying Elements:
You can modify elements using index-based assignment.

In [13]:
series['second'] = 250
print(series)

first     100
second    250
third     300
dtype: int64


### Basic Methods:

##### head():
head(n):Returns the first n elements

In [14]:
print(series.head(2))  

first     100
second    250
dtype: int64


##### tail():
tail(n): Returns the last n elements.

In [15]:
print(series.tail(2))  

second    250
third     300
dtype: int64


##### describe():
describe(): Provides a summary of statistics.

In [16]:
print(series.describe()) 

count      3.000000
mean     216.666667
std      104.083300
min      100.000000
25%      175.000000
50%      250.000000
75%      275.000000
max      300.000000
dtype: float64


##### value_counts():
value_counts(): Returns a Series containing counts of unique values.

In [17]:
print(series.value_counts())  

100    1
250    1
300    1
dtype: int64


## Handling Missing Data:
##### Checking for Missing Data:
Use isna() or isnull() to check for missing values.

In [18]:
series_with_nan = pd.Series([1, 2, None, 4])
print(series_with_nan.isna())

0    False
1    False
2     True
3    False
dtype: bool


##### Filling Missing Data:
Use fillna() to fill missing values.

In [19]:
filled_series = series_with_nan.fillna(0)
print(filled_series)  

0    1.0
1    2.0
2    0.0
3    4.0
dtype: float64


##### Dropping Missing Data: 
Use dropna() to remove missing values.

In [20]:
cleaned_series = series_with_nan.dropna()
print(cleaned_series)  

0    1.0
1    2.0
3    4.0
dtype: float64


## Dataframe:

A pandas DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is one of the most powerful data structures in Python for data manipulation, analysis, and modeling, making it the centerpiece of the pandas library.

#### Key Features of a pandas DataFrame:
Data: Can be various types, including integers, floats, strings, and more complex types like lists and dictionaries.

Columns: Labels for the data in each column.

Index: Labels for the rows, which can be integers or other types, and can be customized.

#### Creating a DataFrame:
You can create a DataFrame from various data sources like lists, dictionaries, NumPy arrays, or even another DataFrame.

##### From a Dictionary of Lists:

In [21]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

In [22]:
df = pd.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


##### From a List of Dictionaries:

In [27]:
data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]

In [24]:
df = pd.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


##### From a NumPy Array:

In [25]:
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
print(df)

   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9


#### Reading data from a csv file:

In [65]:
# Create a sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
})
# Save the DataFrame to a CSV file
df.to_csv('sample_data.csv', index=False)


##### Loading data from csv file:

In [66]:
df = pd.read_csv('sample_data.csv')
print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


#### Basic DataFrame Operations:

##### Accessing Data:
##### By Column:

In [36]:
data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df = pd.DataFrame(data,index=['a','b','c'])

In [37]:
print(df['Name'])  

a      Alice
b        Bob
c    Charlie
Name: Name, dtype: object


##### By Row (using loc and iloc):

loc: Access by labels

In [38]:
print(df.loc['b'])  

Name            Bob
Age              30
City    Los Angeles
Name: b, dtype: object


iloc: Access by integer-location

In [39]:
print(df.iloc[0])  

Name       Alice
Age           25
City    New York
Name: a, dtype: object


Multiple Rows and Columns:

In [40]:
# Access 'Name' and 'City' columns for all rows
print(df.loc[:, ['Name', 'City']])  

      Name         City
a    Alice     New York
b      Bob  Los Angeles
c  Charlie      Chicago


### Adding/Modifying Columns:

##### Adding a New Column:

In [41]:
df['Country'] = ['USA', 'USA', 'USA']
print(df)

      Name  Age         City Country
a    Alice   25     New York     USA
b      Bob   30  Los Angeles     USA
c  Charlie   35      Chicago     USA


##### Modifying an Existing Column:

In [42]:
df['Age'] = df['Age'] + 1
print(df)

      Name  Age         City Country
a    Alice   26     New York     USA
b      Bob   31  Los Angeles     USA
c  Charlie   36      Chicago     USA


### Deleting Rows and Columns:
##### Deleting a Column:

In [43]:
df = df.drop('Country', axis=1)
print(df)

      Name  Age         City
a    Alice   26     New York
b      Bob   31  Los Angeles
c  Charlie   36      Chicago


##### Deleting a Row:

In [45]:
df = df.drop('b', axis=0)  # Drop the row with index 1
print(df)

      Name  Age      City
a    Alice   26  New York
c  Charlie   36   Chicago


#### DataFrame Methods:

##### describe():
describe(): Provides a summary of statistics.

In [46]:
print(df.describe())

             Age
count   2.000000
mean   31.000000
std     7.071068
min    26.000000
25%    28.500000
50%    31.000000
75%    33.500000
max    36.000000


## Aggregate functions:

##### sum():

In [47]:
# Sum of the 'Age' column
print(df['Age'].sum())   

62


##### mean():

In [None]:
 # Mean of the 'Age' column
print(df['Age'].mean()) 

##### min():

In [49]:
 # Minimum of the 'Age' column
print(df['Age'].min()) 

26


##### max():

In [50]:
 # Maximum of the 'Age' column
print(df['Age'].max()) 

36


### Handling Missing Data:

##### Checking for Missing Data:

In [51]:
# Check for missing values
print(df.isna())  

    Name    Age   City
a  False  False  False
c  False  False  False


##### Filling Missing Data:

In [55]:
df_with_nan = df.copy()
df_with_nan.loc[2, 'City'] = None  # Introduce a NaN value
df_filled = df_with_nan.fillna('Unknown')
print(df_filled)

      Name      Age      City
a    Alice     26.0  New York
c  Charlie     36.0   Chicago
2  Unknown  Unknown   Unknown


##### Dropping Missing Data:

In [56]:
df_dropped = df_with_nan.dropna()
print(df_dropped)

      Name   Age      City
a    Alice  26.0  New York
c  Charlie  36.0   Chicago


### DataFrame Operations:

##### Arithmetic Operations:
    You can perform element-wise arithmetic operations on DataFrames.

In [57]:
df['Age'] = df['Age'] * 2
print(df)

      Name  Age      City
a    Alice   52  New York
c  Charlie   72   Chicago


##### Applying Functions: 
      Use apply() to apply functions across rows or columns.

In [58]:
df['Age Group'] = df['Age'].apply(lambda x: 'Senior' if x > 60 else 'Adult')
print(df)

      Name  Age      City Age Group
a    Alice   52  New York     Adult
c  Charlie   72   Chicago    Senior


### Merging, Joining, and Concatenating DataFrames:

##### Concatenation:
       You can concatenate DataFrames along rows or columns.

In [60]:
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
df_concat = pd.concat([df1, df2])
print(df_concat)

   A  B
0  1  3
1  2  4
0  5  7
1  6  8


##### Merging: 
      Merge DataFrames using a key or index.

In [61]:
df_left = pd.DataFrame({'key': ['A', 'B'], 'left_val': [1, 2]})
df_right = pd.DataFrame({'key': ['B', 'C'], 'right_val': [3, 4]})
df_merged = pd.merge(df_left, df_right, on='key', how='inner')
print(df_merged)

  key  left_val  right_val
0   B         2          3


##### Joining: 
        Join DataFrames on their index.

In [62]:
df_left = pd.DataFrame({'A': [1, 2]}, index=['a', 'b'])
df_right = pd.DataFrame({'B': [3, 4]}, index=['a', 'c'])
df_joined = df_left.join(df_right, how='inner')
print(df_joined)

   A  B
a  1  3


#### Grouping Data:
You can group data based on column values and apply aggregate functions.

In [63]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
    'Age': [25, 30, 35, 25],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York']
})

grouped = df.groupby('City')
print(grouped['Age'].mean())  # Mean age per city


City
Chicago        35.0
Los Angeles    30.0
New York       25.0
Name: Age, dtype: float64


### Application of analyzing pandas in data science:
Pandas is a crucial library in data science for managing and analyzing data. Here are some key applications:
##### Data Cleaning and Preprocessing:
         Handling missing data, removing duplicates, and transforming data types.
##### Exploratory Data Analysis (EDA):
         Summarizing data with descriptive statistics, visualizing trends, and identifying correlations.
##### Data Filtering and Selection:
         Querying and slicing data to focus on specific subsets.
##### Data Aggregation and Grouping:
         Grouping data for summary statistics and creating pivot tables for in-depth analysis.
##### Time Series Analysis:
         Analyzing and forecasting trends in time-stamped data.
##### Data Integration:
         Merging, joining, and concatenating multiple datasets for comprehensive analysis.
##### Feature Engineering:
        Creating and transforming features for machine learning models.
##### Data Export and Reporting:
        Exporting cleaned and analyzed data to various formats for reporting and further use.

Pandas simplifies data handling, making it essential for data preparation, analysis, and modeling in data science.

#### Analyzing data with a real time example:

Problem Statement:

A retail store wants to analyze its sales data to understand which products are performing well and which are not. The goal is to make informed decisions about inventory management and promotional strategies.

Steps to follow:
##### Data Collection and Import:

The store collects sales data in a CSV file that includes columns for product IDs, product names, sales amounts, and dates of sales.


Import the data into a pandas DataFrame:

In [84]:
import pandas as pd

In [85]:
data = {
    'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02', '2024-01-03', '2024-01-03', '2024-02-01', '2024-02-01', '2024-02-02',
             '2024-02-02', '2024-02-03', '2024-02-03', '2024-03-01', '2024-03-01', '2024-03-02', '2024-03-02', '2024-03-03', '2024-03-03'],
    'Product': ['Product A', 'Product B', 'Product A', 'Product C','Product B', 'Product C', 'Product A', 'Product B', 'Product A', 
                'Product C', 'Product B', 'Product C', 'Product A', 'Product B', 'Product A', 'Product C', 'Product B', 'Product C'],
    'Quantity': [10, 5, 7, 3, 8, 6, 12, 4, 9, 5, 7, 8, 15, 6, 10, 7, 5, 6],
    'SalesAmount': [200, 100, 140, 90, 160, 180, 240, 80, 180, 150, 140, 240, 300, 120, 200, 210, 100, 180]
    }

In [86]:
df = pd.DataFrame(data)

In [87]:
df.to_csv('sales_data.csv', index=False)

In [88]:
df = pd.read_csv('sales_data.csv')

##### Data Cleaning and Preprocessing:

Convert the Date column to a datetime format for easier analysis

In [89]:
df['Date'] = pd.to_datetime(df['Date'])

Handle missing values, if any. For simplicity, let's fill missing Quantity values with 0.

In [90]:
df['Quantity'].fillna(0, inplace=True)

##### Exploratory Data Analysis (EDA):

Calculate total sales for each product

In [91]:
product_sales = df.groupby('Product')['SalesAmount'].sum().sort_values(ascending=False)
print(product_sales)

Product
Product A    1260
Product C    1050
Product B     700
Name: SalesAmount, dtype: int64


##### Sales Performance Analysis:

Calculate average sales per product

In [94]:
avg_sales_per_product = df.groupby('Product')['SalesAmount'].mean()
print(avg_sales_per_product)

Product
Product A    210.000000
Product B    116.666667
Product C    175.000000
Name: SalesAmount, dtype: float64


##### Reporting:
Generate a summary report of sales performance and trends

In [98]:
sales_summary = df.groupby('Product').agg({
    'SalesAmount': ['sum', 'mean'],
    'Quantity': 'sum'
})
sales_summary.to_csv('sales_summary_report.csv')
print(sales_summary)


          SalesAmount             Quantity
                  sum        mean      sum
Product                                   
Product A        1260  210.000000       63
Product B         700  116.666667       35
Product C        1050  175.000000       35


## Advantages Of Pandas in Data Science:

##### Ease of Use:
      Intuitive syntax and user-friendly data structures (`DataFrame` and `Series`) simplify data manipulation.
##### Data Cleaning:
      Efficient handling of missing data and powerful data transformation tools.
##### Data Analysis:
      Built-in functions for statistics, grouping, and aggregation make data analysis straightforward.
##### Integration:
      Seamless compatibility with libraries like NumPy, SciPy, and visualization tools like Matplotlib.
##### Time Series Analysis: 
      Robust support for time-based data, including resampling and rolling statistics.
##### File Handling:
      Versatile support for reading and writing various file formats (CSV, Excel, JSON, SQL).
##### Performance:
      Optimized for fast operations and efficient memory usage with vectorized operations.
##### Flexibility:
      Customizable operations with functions like `apply()`, and flexible indexing.
