# **PANDAS LIBRARY IN PYTHON**
# what is pandas
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.
# Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.
# What Can Pandas Do?
Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?

#pandas series
**What is a Series?**
* A Pandas Series is like a column in a table.

* It is a one-dimensional array holding data of any type.

In [None]:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

0    1
1    7
2    2
dtype: int64


# What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.


In [None]:
#data frame in pandas
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df)

   calories  duration
0       420        50
1       380        40
2       390        45


In [None]:
#Named Indexes
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)

      calories  duration
day1       420        50
day2       380        40
day3       390        45


# Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In our examples we will be using a CSV file called 'data.csv'.

In [None]:
import pandas as pd

df = pd.read_csv('/content/environmental-protection-expenditure-account-2009-2022.csv')
print(df.to_string())
print(print(pd.options.display.max_rows) )

     year              sector                                   class                              cfn_tle1                       cfn_tle2       units magnitude                  source  data_value flag
0    2009  Central government                                   Total  Environmental protection expenditure  Final consumption expenditure  Proportion    Actual  Environmental Accounts         1.2    F
1    2010  Central government                                   Total  Environmental protection expenditure  Final consumption expenditure  Proportion    Actual  Environmental Accounts         1.2    F
2    2011  Central government                                   Total  Environmental protection expenditure  Final consumption expenditure  Proportion    Actual  Environmental Accounts         1.1    F
3    2012  Central government                                   Total  Environmental protection expenditure  Final consumption expenditure  Proportion    Actual  Environmental Accounts        

In [None]:
print('Name,City,Age,Gender,Occupation')
print('Alice,New York,25,Female,Engineer')
print('Bob,London,30,Male,Doctor')
print('Charlie,Paris,22,Male,Artist')
print('David,Tokyo,28,Male,Teacher')

Name,City,Age,Gender,Occupation
Alice,New York,25,Female,Engineer
Bob,London,30,Male,Doctor
Charlie,Paris,22,Male,Artist
David,Tokyo,28,Male,Teacher


**PANDAS DATAFRAME**
 * While NumPy is great for numerical operations, Pandas is more suited for the operations you described. Here's how you can perform those operations using Pandas:

In [None]:
import pandas as pd

# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 22, 28],
        'City': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)

# Selecting data
names = df['Name']  # Select the 'Name' column
alice_data = df[df['Name'] == 'Alice']  # Select row where Name is 'Alice'

# Filtering rows
age_above_25 = df[df['Age'] > 25]  # Select rows where Age is greater than 25

# Modifying data
df['Age'] = df['Age'] + 1  # Increase everyone's age by 1
df.loc[df['Name'] == 'Alice', 'City'] = 'Seattle'  # Change Alice's city to Seattle

# Print the modified DataFrame
print(df)

      Name  Age     City
0    Alice   26  Seattle
1      Bob   31   London
2  Charlie   23    Paris
3    David   29    Tokyo


**Pandas Read CSV**
* A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In our examples we will be using a CSV file called 'data.csv'.

In [None]:
import pandas as pd

df = pd.read_csv('/content/environmental-protection-expenditure-account-2009-2022.csv')

print(df.to_string())

     year              sector                                   class                              cfn_tle1                       cfn_tle2       units magnitude                  source  data_value flag
0    2009  Central government                                   Total  Environmental protection expenditure  Final consumption expenditure  Proportion    Actual  Environmental Accounts         1.2    F
1    2010  Central government                                   Total  Environmental protection expenditure  Final consumption expenditure  Proportion    Actual  Environmental Accounts         1.2    F
2    2011  Central government                                   Total  Environmental protection expenditure  Final consumption expenditure  Proportion    Actual  Environmental Accounts         1.1    F
3    2012  Central government                                   Total  Environmental protection expenditure  Final consumption expenditure  Proportion    Actual  Environmental Accounts        

**Viewing the Data**
* One of the most used method for getting a quick overview of the DataFrame, is the head() method.

* The head() method returns the headers and a specified number of rows, starting from the top.



In [None]:
import pandas as pd

df = pd.read_csv('/content/environmental-protection-expenditure-account-2009-2022.csv')

print(df.head(10))

   year              sector  class                              cfn_tle1  \
0  2009  Central government  Total  Environmental protection expenditure   
1  2010  Central government  Total  Environmental protection expenditure   
2  2011  Central government  Total  Environmental protection expenditure   
3  2012  Central government  Total  Environmental protection expenditure   
4  2013  Central government  Total  Environmental protection expenditure   
5  2014  Central government  Total  Environmental protection expenditure   
6  2015  Central government  Total  Environmental protection expenditure   
7  2016  Central government  Total  Environmental protection expenditure   
8  2017  Central government  Total  Environmental protection expenditure   
9  2018  Central government  Total  Environmental protection expenditure   

                        cfn_tle2       units magnitude  \
0  Final consumption expenditure  Proportion    Actual   
1  Final consumption expenditure  Proportion   

#PANDAS CLEANING DATA

**Data Cleaning**
* Data cleaning means fixing bad data in your data set.

Bad data could be:

* Empty cells
* Data in wrong format
* Wrong data
* Duplicates

**Remove Rows**
* One way to deal with empty cells is to remove rows that contain empty cells.

* This is usually OK, since data sets can be very big, and removing a few rows will not have a big impact on the result.



In [None]:
import pandas as pd

df = pd.read_csv('/content/environmental-protection-expenditure-account-2009-2022.csv')

new_df = df.dropna()

print(new_df.to_string())

     year              sector                                   class                              cfn_tle1                       cfn_tle2       units magnitude                  source  data_value flag
0    2009  Central government                                   Total  Environmental protection expenditure  Final consumption expenditure  Proportion    Actual  Environmental Accounts         1.2    F
1    2010  Central government                                   Total  Environmental protection expenditure  Final consumption expenditure  Proportion    Actual  Environmental Accounts         1.2    F
2    2011  Central government                                   Total  Environmental protection expenditure  Final consumption expenditure  Proportion    Actual  Environmental Accounts         1.1    F
3    2012  Central government                                   Total  Environmental protection expenditure  Final consumption expenditure  Proportion    Actual  Environmental Accounts        

**Data of Wrong Format**
* cells with data of wrong format can make it difficult, or even impossible, to analyze data.

* To fix it, you have two options: remove the rows, or convert all cells in the columns into the same format.



**Convert Into a Correct Format**
* In our Data Frame, we have two cells with the wrong format. Check out row 22 and 26, the 'Date' column should be a string that represents a date:



# Pandas Fixing Wrong Data
**WRONG DATA**
   * "Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, like if someone registered "199" instead of "1.99".

  * Sometimes you can spot wrong data by looking at the data set, because you have an expectation of what it should be.

  * If you take a look at our data set, you can see that in row 7, the duration is 450, but for all the other rows the duration is between 30 and 60.

  * It doesn't have to be wrong, but taking in consideration that this is the data set of someone's workout sessions, we conclude with the fact that this person did not work out in 450 minute




**Pandas Data Correlatio**
  * A great aspect of the Pandas module is the corr() method.

  * The corr() method calculates the relationship between each column in your data set.





# SUMMARY
*  Pandas for Data Science Professionals

* Pandas offers several advantages for data science professionals compared to
*  traditional Python data structures like lists or dictionaries:

* 1. Data Manipulation: Pandas provides powerful tools for data manipulation,
 * allowing you to easily filter, sort, group, and transform data.
* 2. Data Cleaning: Pandas simplifies data cleaning tasks, such as handling
 * missing values, removing duplicates, and converting data types.
* 3. Data Analysis: Pandas offers a wide range of data analysis functions,
 * including statistical analysis, correlation analysis, and data visualization.
* 4. Efficiency: Pandas is built for performance, especially when working
 * with large datasets, thanks to its optimized data structures and algorithms.
* 5. Data Integration: Pandas seamlessly integrates with other data science
 * libraries, such as scikit-learn for machine learning.

**Pandas is widely used in various domains. Here are some real-world examples of how Pandas is applied in data cleaning and exploratory data analysis (EDA)**
1. **Data Cleaning in Financial Analysis**
* **Scenario** : A financial analyst receives a dataset of stock prices with missing values, incorrect data types, and outliers.
* **Pandas Solution** : They can use Pandas to:
 * **Identify and handle missing values:** df.isnull().sum() to detect missing values, and methods like df.fillna() to fill them with appropriate values (e.g., mean, median).
 * **Correct data types:** Use df.astype() to convert columns to the correct data type (e.g., convert a price column from string to float).
 * **Remove or adjust outliers:** Use visualization techniques like box plots to identify outliers, and then filter or transform the data accordingly.
2. **EDA in Customer Analytics**
* **Scenario:** A marketing team wants to understand customer behavior from a dataset of customer purchases, demographics, and website interactions.
* **Pandas Solution**
 * **Data exploration:** Use df.describe() to get summary statistics, df.groupby() to analyze data by customer segments, and df.corr() to find correlations between variables (e.g., purchase frequency and age).
 * **Data visualization:** Create histograms, scatter plots, and other visualizations using Pandas plotting functions or integrate with libraries like Matplotlib and Seaborn to gain insights from the data.
 * **Feature engineering:** Create new features from existing ones, such as calculating customer lifetime value or segmenting customers based on purchase behavior.
* **3.Data Cleaning in Healthcare:**
 * **Scenario:** A healthcare researcher has a dataset of patient records with inconsistencies in data entry, missing values, and duplicate entries.
 * **Pandas Solution:**
   * **Standardize data:** Use string functions (df['name'].str.lower()) to clean and standardize text data (e.g., patient names, diagnoses).
   * **Identify and remove duplicates:** Use df.duplicated() and df.drop_duplicates() to find and remove duplicate patient records.
   * **Handle missing data:** Use various imputation techniques available in Pandas to fill missing values based on the nature of the data.
These examples illustrate how Pandas simplifies common data cleaning and EDA tasks, enabling data professionals to prepare and analyze data effectively.

# CONCLUSION

* This notebook provided a comprehensive overview of the Pandas library in Python,
* covering its core functionalities such as Series, DataFrames, data cleaning,
* and data manipulation.

* Pandas is an indispensable tool for data analysis and manipulation, offering a wide
* range of functions to efficiently handle and analyze data. By leveraging Pandas,
* you can streamline your data workflows, gain deeper insights from your data, and make more informed decisions.

* As you continue your data science journey, mastering Pandas will be crucial for
* tackling complex data challenges and extracting meaningful information from your datasets.