# Pnadas Library

In this section we will learn how to use pandas for data analysis. You can think of pandas as an extremely powerful version of Excel, with a lot more features.

Outline:

- Pandas Info
- Installing pandas Library
- Series objects
- DataFrame



## Pandas Info

**What is Pandas?**

- Pandas is a Python library  that provides fast, flexible, and expressive data structures designed to make working with data sets.

- It has functions for analyzing, cleaning, exploring, and manipulating data.

- The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

**Why Use Pandas?**
- Pandas allows us to analyze big data and make conclusions based on statistical theories.

- Pandas can clean messy data sets, and make them readable and relevant.

- Relevant data is very important in data science.

- easy-to-use data structures and data analysis tools.

- The main data structure is the `DataFrame`, which you can think of as an in-memory 2D table (like a spreadsheet, with column names and row labels).
- fore more information please check linke:https://pandas.pydata.org/docs/getting_started/install.html

**What Can Pandas Do?**

**Pandas gives you answers about the data. Like:**

- Is there a correlation between two or more columns?
- What is average value?
- Max value?
- Min value?
- Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.

## Installing Pandas Library

In [None]:
# !pip install pandas
# import pandas as pd
import pandas as pd

# import numpy as np
import numpy as np

## `Series` objects
The pandas library contains these useful data structures:
* `Series` object is 1D array, similar to a column in a spreadsheet (with a column name and row labels).
* Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.)
* A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.


<img src=https://media.geeksforgeeks.org/wp-content/uploads/dataSER-1.png width=600>


In [None]:
s = pd.Series([2,-1,3,5])
print(type(s))
print(s.dtype)
print(s)

Arithmetic operations on Series are also possible, and they apply elementwise, just like for ndarrays:

In [None]:
s + [1000,2000,3000,4000]

Similar to NumPy, if you add a single number to a Series, that number is added to all items in the Series. This is called * broadcasting*:

In [None]:
s + 1000

The same is true for all binary operations such as * or /, and even conditional operations:

In [None]:
s < 0

In [None]:
s

In [None]:
s[s<0].index

## Index labels
Each item in a `Series` object has a unique identifier called the *index label*. By default, it is simply the rank of the item in the `Series` (starting at `0`) but you can also set the index labels manually:

In [None]:
# simple array
data = np.array(['Ali','Mahmoud','Essa','Sameer','Samai'])
ser = pd.Series(data, index= [5,10,15,20,25], name = "N")
print(ser)


In [None]:

data=['Ali','Mahmoud','Essa','Sameer','Samai']
ser = pd.Series(data)

print(ser)


In [None]:

data=['Ali','Mahmoud','Essa','Sameer','Samai']
ser = pd.Series(data,index = np.arange(0,5))

print(ser)
print(type(ser))

In [None]:

data=['Ali','Mahmoud','Essa','Sameer','Samai']
ser = pd.Series(data,index = np.arange(0,5))

print(ser)
print(type(ser))

# DataFrame

## Creating DataFrame
- Pandas DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It is generally the most commonly used pandas object.
- Pandas DataFrame can be created in multiple ways. Let’s discuss different ways to create a DataFrame one by one.
- please visit link for more information: https://www.youtube.com/watch?v=dEHJmn6p39M&t=93s
<img src=https://media.geeksforgeeks.org/wp-content/cdn-uploads/creating_dataframe1.png width=800>


### Method #1: Creating Pandas DataFrame from lists of lists.

In [None]:
# Import pandas library
import pandas as pd

# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14]]

# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])

# print dataframe.
df


### Method #2: Creating DataFrame from dict of narray/lists
To create DataFrame from dict of narray/list, all the narray must be of same length. If index is passed then the length index should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is the array length.

In [None]:
# Python code demonstrate creating
# DataFrame from dict narray / lists
# By default addresses.

import pandas as pd

# initialize data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
        'Age':[20, 21, 19, 18]}

# Create DataFrame
df = pd.DataFrame(data)

# Print the output.
df


### Method #3: Creates a indexes DataFrame using arrays.

In [None]:
# Python code demonstrate creating
# pandas DataFrame with indexed by

# DataFrame using arrays.
import pandas as pd

# initialize data of lists.
data = {'Name':['Tom', 'Jack', 'nick', 'juli'],
        'marks':[99, 98, 95, 90, ]}

# Creates pandas DataFrame.
df = pd.DataFrame(data, index =['rank1',
                                'rank2',
                                'rank3',
                                'rank4',
                                ])

# print the data
df


### Method #4: Creating Dataframe from list of dicts
Pandas DataFrame can be created by passing lists of dictionaries as a input data. By default dictionary keys taken as columns.

In [None]:
# Python code demonstrate how to create
# Pandas DataFrame by lists of dicts.
import pandas as pd

# Initialize data to lists.
data = [{'a': 1, 'b': 2, 'c':3, "d" :10},
        {'a':10, 'b': 20, 'c': 30}]

# Creates DataFrame.
df = pd.DataFrame(data)

# Print the data
df


In [None]:
data[0].keys()

### Method #5: Creating Dataframe from Series
You can create a DataFrame by passing a dictionary of `Series` objects:

In [None]:
people_dict = {
    "weight": pd.Series([68, 83, 112], index=["alice", "bob", "charles"]),
    "birthyear": pd.Series([1984, 1985, 1992], index=["bob", "alice", "charles"]),
    "children": pd.Series([0, 3], index=["charles", "bob"]),
    "hobby": pd.Series(["Biking", "Dancing"], index=["alice", "bob"]),
}
people = pd.DataFrame(people_dict)
people

In [None]:
dict_1={"car":["city","ignis","800","Verna","Venue","Punto"],"brand":["Honda","maruti","Maruti","Hyundia","Hyundai","Fiat"],
         "cost":[900000,600000,100000,800000,950000,750000],"year":[5,7,10,4,2,6]}

In [None]:
auto=pd.DataFrame(dict_1)
auto

In [None]:
auto.to_csv('automobile.csv')

### Dealing with Dataframe

In [None]:
import pandas as pd

In [None]:
data=pd.read_csv("bike_rentals(notebook).csv")

In [None]:
data.head(2)

In [None]:
data.info()

In [None]:
print("# of rows", len(data))
print("# of columns", len(data.columns))


In [None]:
data.shape

In [None]:
data.columns

In [None]:
data.describe()

In [None]:
data.head(1)

In [None]:
data['season'].value_counts()


In [None]:
print("# of unique dayname",data["dayname"].nunique())
print(data["dayname"].unique())
print(data["dayname"].value_counts())

In [None]:
data["dayname"].unique()

In [None]:
data.isnull().sum()

In [None]:
print(data.columns[data.isnull().any()].values)

In [None]:
data.isnull().sum(axis=1)

In [None]:
a = [[100, 2, 3], [10, 9, 8]]
a.sort(key=lambda x: x[1])
print(a)

### Accessing Columns

In [None]:
data['dayname']

In [None]:
data[['weathersit', 'temp']]

### Accessing rows
Let's go back to the `people` `DataFrame`:

In [None]:
people = pd.DataFrame({
    "birthyear": {"alice":1985, "bob": 1984, "charles": 1992},
    "hobby": {"alice":"Biking", "bob": "Dancing"},
    "weight": {"alice":68, "bob": 83, "charles": 112},
    "children": {"bob": 3, "charles": 0}
})
people

The loc attribute lets you access rows instead of columns. The result is a Series object in which the DataFrame's column names are mapped to row index labels:

In [None]:
people.loc["charles"]

You can also access rows by integer location using the iloc attribute:

In [None]:
people.iloc[2]

You can also get a slice of rows, and this returns a DataFrame object:

In [None]:
people.iloc[1:3]

Finally, you can pass a boolean array to get the matching rows:

In [None]:
people[np.array([True, False, True])]

This is most useful when combined with boolean expressions:

In [None]:
people["birthyear"] < 1990

In [None]:
people[people["birthyear"] < 1990]

### Adding and removing columns
You can generally treat DataFrame objects like dictionaries of Series, so the following work fine:

In [None]:
people

In [None]:
import datetime

new_df = people.copy()
new_df["age"] = (
    datetime.datetime.now().year - new_df["birthyear"]
)  # adds a new column "age"

# people["age"]
new_df["over 35"] = new_df["age"] > 35  # adds another column "over 30"

birthyears = new_df.pop("birthyear")
new_df.drop("weight", axis=1, inplace=True)
del new_df["children"]

new_df

In [None]:
birthyears

When you add a new colum, it must have the same number of rows. Missing rows are filled with NaN, and extra rows are ignored:

In [None]:
people["pets"] = pd.Series({"bob": 0, "charles": 5, "eugene":1})  # alice is missing, eugene is ignored
people

When adding a new column, it is added at the end (on the right) by default. You can also insert a column anywhere else using the insert() method:

In [None]:
people.insert(1, "height", [172, 181, 185])
people

In [None]:
pd.Series([1, 2, 3, 4], name="numbers", index=["a", "b", "c", "d"])

In [None]:
import string

a = pd.Series(list(string.ascii_uppercase), name="values")
df = pd.DataFrame(a)
df["ord"] = df["values"].apply(ord)
df

# Changing index

In [None]:
data=data.set_index("season")

In [None]:
data.head(5)

In [None]:
data.reset_index(inplace = True)

In [None]:
data.head()

# ***Filtering of data***

In [None]:
new_data = data["dayname"] == "Saturday"
new_data

In [None]:
data.loc[new_data]

In [None]:
auto

In [None]:
isin_1 = auto['car'].isin(['800','city'])

In [None]:
auto.loc[isin_1]

In [None]:
start=auto['brand'].str.startswith('H')
auto.loc[start]

In [None]:
ends = auto['brand'].str.endswith('i')
auto.loc[ends]

In [None]:
contains = ~auto['brand'].str.contains('o')
auto.loc[contains]

In [None]:
auto.loc[-contains]

In [None]:
isna = auto['car'].isna()
auto.loc[isna]

In [None]:
notna = auto['car'].notna()
auto.loc[notna]

### Sorting of Data

In [None]:
auto.sort_values("cost")

In [None]:
from datetime import datetime


class Person:
    def __init__(self, name: str, birthdate: datetime) -> None:
        self.name = name
        self.birthdate = birthdate

    def cal(self):
        next_birthday = datetime(
            datetime.now().year, self.birthdate.month, self.birthdate.day
        )
        if datetime.now() > next_birthday:
            next_birthday = datetime(
                datetime.now().year + 1, self.birthdate.month, self.birthdate.day
            )
        print(next_birthday)
        print(datetime.now())
        return (next_birthday - datetime.now()).days


mohammad = Person("Mohammad Mrayyan", datetime(2001, 2, 7))
print(mohammad.cal())

In [None]:
data.sort_values(["dayname", "season"], ascending=[True, False])

### Updating Columns and Rows

*Changing the name of columns*

In [None]:
auto.columns= pd.Series(auto.columns).apply(str.capitalize)
auto

In [None]:
auto.columns=[x.lower() for x in auto.columns]

In [None]:
auto

In [None]:
auto.rename(columns={"brand":"company","cost":"price"},inplace=True)
auto

In [None]:
auto

*Updating Columns*

In [None]:
auto["price"]= auto["price"].replace({100000:123})

In [None]:
auto

In [None]:
auto["car"].apply(str.upper)

In [None]:
def price_change(price):
    return price/10

In [None]:
auto["price"].apply(price_change)

In [None]:
auto["price"].apply(lambda x:x*10)

In [None]:
auto["car"].replace({"800":"eight hundred","ignis":"swift"})

*Updating Rows*

In [None]:
auto.loc[2,["car","price"]]=["dezire",600000]

In [None]:
auto

In [None]:
auto["company"].replace({"Honda":"H"})

*Adding and removing a column*

In [None]:
fuel_type=["Diesel","Petrol","Petrol","Diesel","Petrol","Diesel"]

In [None]:
auto["Fuel Types"]=fuel_type

In [None]:
auto

In [None]:
auto.drop("Fuel Types",axis=1,inplace=True)

In [None]:
auto

*Adding and Removing Rows*

In [None]:
auto = auto.append({"car": "Tiago", "company": "TATA"}, ignore_index=True)

In [None]:
auto.drop(index=6)

### Grouping And Aggregation

In [None]:
data.groupby(['season']).describe()

In [None]:
groups.mean()

In [None]:
groups.size()

In [None]:
groups.count()

### Cleaning of Data

In [None]:
import numpy as np

dict_1={"car":["city","ignis",np.nan,"Verna","Venue","Punto",np.nan],"brand":["Honda","maruti","Maruti","Hyundia",np.nan,"Fiat",np.nan],
         "cost":[900000,600000,np.nan,800000,950000,750000,np.nan,],"year":[5,7,10,np.nan,np.nan,6,np.nan]}


auto=pd.DataFrame(dict_1)
auto

In [None]:
auto.isnull().sum()

In [None]:
auto.dropna(axis="index",how='all',subset=["cost"])

In [None]:
auto["cost"].fillna(0,inplace=True)

In [None]:
auto

### Concat() function in Pandas DtaFrame

In [None]:
# importing the module
import pandas as pd

# creating the DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
					'B': ['B0', 'B1', 'B2', 'B3']})
display('df1:', df1)
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
					'B': ['B4', 'B5', 'B6', 'B7']})
display('df2:', df2)

# concatenating
print('After concatenating:')
display(pd.concat([df1, df2], keys = ["Key1", "Key2"]))


In [None]:
# importing the module
import pandas as pd

# creating the DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
					'B': ['B0', 'B1', 'B2', 'B3']})
display('df1:', df1)
df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
					'D': ['D0', 'D1', 'D2', 'D3']})
display('df2:', df2)

# concatenating
display('After concatenating:')
display(pd.concat([df1, df2],
				axis = 1))


In [None]:
import pandas as pd

# creating the DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
					'B': ['B0', 'B1', 'B2', 'B3']})
display('df1:', df1)
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
					'B': ['B4', 'B5', 'B6', 'B7']})
display('df2:', df2)

# concatenating
display('After concatenating:')
display(pd.concat([df1, df2],
				ignore_index = True))


In [None]:
# importing the module
import pandas as pd

# creating the DataFrame
df = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
					'B': ['B0', 'B1', 'B2', 'B3']})
display('df:', df1)
# creating the Series
series = pd.Series([1, 2, 3, 4],name = "C")
display('series:', series)

# concatenating
display('After concatenating:')
display(pd.concat([df, series],
				axis = 1))


In [None]:
import pandas as pd
import timeit

# Method 1: Using iterrows()
def method_with_iterrows(df):
    for index, row in df.iterrows():
        df.at[index, 'new_column'] = row['old_column'].upper()
    return df

# Method 2: Without iterrows() (using vectorized operations)
def method_without_iterrows(df):
    df['new_column'] = df['old_column'].str.upper()
    return df

# Generate sample data
data = {'old_column': ['apple', 'banana', 'orange'] * 10**5}
df = pd.DataFrame(data)

# Measure time for method with iterrows()
time_with_iterrows = timeit.timeit("method_with_iterrows(df.copy())", globals=globals(), number=1)

# Measure time for method without iterrows()
time_without_iterrows = timeit.timeit("method_without_iterrows(df.copy())", globals=globals(), number=1)

# Display results
print(f"Method with iterrows() execution time: {time_with_iterrows:.5f} seconds")
print(f"Method without iterrows() execution time: {time_without_iterrows:.5f} seconds")


In [None]:
import timeit
def test():
    [i for i in range(10**7)]
def test2():
    np.arange(10**6)
t = timeit.timeit("test()",globals=globals(), number=1)
t2 = timeit.timeit("test2()",globals=globals(), number=1)
print(t)
print(t2)