# Groups of Functions in Pandas for Data Analysis

### A. Creating Series and DataFrames

* I believe that we have learnt list and dictionary data structures when we were learning python. Now, we want to learn how to use both list and dictionaries for creating Pandas Series and DataFrames.


**Creating a Pandas Series**

To do anything with pandas, the first thing to do is to import the pandas library as an alias.

* importing pandas package
```c
import pandas as pd
```

* Creating pandas series
```c
series = pd.Series(data)
```

* Creating pandas DataFrame
```c
dataframe = pd.DataFrame(data)
```


In [None]:
# Lets create a pandas Series using a python list

#Step 1: Import pandas package

import pandas as pd

#Step2: Define a list
data = [1,2,3,4,5,6,7,8,9,10]


#Step3: Create the series
series = pd.Series(data)

# lets view the series that we have created
series.head(10)



In [None]:
# lets confirm to be sure we had created a pandas series
type(series)

In [None]:
# Lets create a series using the same list, but now we will be adding our own serial numbering, in python or pandas it is called index
series2 = pd.Series(data, index = ["a","b","c","d","e","f","g","h","i","j"])
series2.head(10)

In [None]:
# Lets create a series using python dictionary


#lets create a python dictionary
data2 = {'a': 10, 'b': 20, 'c': 30}

# lets create the series
series3 = pd.Series(data2)
series3.head()

__________________________________________________
**Hands on practice**:
1. Create a bucket list of 6 items. Convert the list to pandas series and define index for it using alphabets.
2. Create a simple python dictionary of your biodata with 5 keys and their corresponding values. Convert the dictionary into a pandas series.

_depending on where you are viewing this notebook you are either to download it or make a copy._
________________________________________________

**Creating a DataFrame**


  ```c
  import pandas as pd
  ```
* Create your list of list or dictionary
```c
data = []
#or
data = {}
```
* Create the dataframe using this syntax
```c
df = pd.DataFrame(data)
```

In [None]:

# LEts create a dataframe
#Step1: import pandas

import pandas as pd

# Define the data using dictionary that is having its values as a list.

data = {
    'Name': ['Chris', 'Ayo', 'Chisom'],
    'Age': [26, 24, 22],
    'Home_Town': ['Benin', 'Ibadan', 'Enugu']
}

# Lets create the dataframe using "df" as short for dataframe
df = pd.DataFrame(data)
df.head()

In [None]:
# lets do the samething by using list of dictionaries
data2 = [
    {'Name': 'Chris', 'Age': 26, 'Home_Town': 'Benin'},
    {'Name': 'Ayo', 'Age': 24, 'Home_Town': 'Ibadan'},
    {'Name': 'Chisom', 'Age': 22, 'Home_Town': 'Enugu'}
]
# LEts define the dataframe
df2 = pd.DataFrame(data2)
df2.head()

In [None]:
# Lets do the sanething again using list of list

data3 = [
    ['Chris', 26, 'Benin'],
    ['Ayo', 24, 'Ibadan'],
    ['Chisom', 22, 'Enugu']
]
df3 = pd.DataFrame(data3, columns=['Name', 'Age', 'Home_Town'])
df3.head()

In [None]:
# lets print the types to be sure we have defined dataframes
print(type(df))
print(type(df2))
print(type(df3))

**Hands on practice**

* **Creating a dataset:**

Lets create a google sheet, make the link accessible to everyone to input the following information First_Name, Last_Name, Gender, Seat_No, City, Course_Track, PC_make, PC_Os, and Feedback.


[Click here to respond](https://forms.gle/8VQgWmvqQyiPEifY8)

At the end of the collection, we will use the data to practice data manipulation.
---



### B. Data Input and Output:

**To read in datasets we use**
```c
pd.read_csv() # for csv files
```

```c
pd.read_excel() # for excel files
```
**Note**: There are many other methods for reading in different data files based on their extensions. we have .json, .txt, .sql, .html etc. If you are curious you could check them out.


**To save into csv file or excel file**

```c
df.to_csv()

```
To save to excel
```c
df.to_excel()
```
Usecase example
```c
bio_data.to_csv("bio_data.csv", index = False)
```

Here, we would download our generated data in csv format and in excel format. Then load it using the `pd.read_csv()`

Then we would inspect and explore the data.

In [None]:
# # Lets get to work...
# df = pd.read_csv('bio_data.csv')
# df.head()

# # Ensure to code along...

### C. Data Inspection and Exploration

To inspect our dataset we will beusing the following python methods
```c
.head() # To view the first 5 rows
```

```c
.tail() # To view the last 5 rows
```

```c
.info() # To check the information about the data
```

```c
.describe() # statistical summary
```

```c
.shape # Check the dimension of the dataset
```

```c
.columns # for checking the column names
```

In [None]:
# LEts use all of this methods on our data



# Please, ensure to code along

### D. Data Cleaning

 Data cleaning involves identifying and handling errors or inconsistencies in your dataset. Later in this course, data cleaning would be handled in datails.

Handling Missing Values

```c
.isna() or .isnull() # Check for missing values
```

```c
.isna().sum()  # Check the total number of all missing values
```

```c
.fillna() # Fill up missing values
```

```c
.dropna() # Drop missing values
```

Finding and Handling Duplicates

Duplicated are repeated rows or columns.

```c
df.duplicated() # This checks if there are duplicates
```

```c
df.drop_duplicated() # This is use for dropping the duplicate  values

Correcting Data Types

In pandas there are two main types of datatypes, "integer" and "Object"

You can check data type using
```c

df.dtype()
```

To convert the type of perform type casting, you use

```c
df.astype() # this takes in the datatype you want to convert it to as an argument
```

When working with time or time series dat its important to convert the time to pandas recognized time using

```c
pd.to_datatime() # takes in the data column as an argument
```


In [None]:
# Do we have any missing values? if yes,lets fill them up


bio_data.isna().sum()


# Ensure to code along

### E. Data Selection and Filtering

Viewing data column

In [None]:
#bio_data_column = [irst_Name, Last_Name, City, Course_Track, PC_make, PC_Os, Feedback]

In [None]:
bio_data.columns

Column selection

In [None]:
# lets look through a single column
bio_data['First_Name']

# alternatively, we can use dot
bio_data.First_Name

In [None]:
# LEts select multiple columns
bio_data[['First_Name', 'Last_Name']]

In [None]:
# lets select more columns
bio_data[['First_Name', 'City', 'Feedback']]

Cell Selection

In [None]:
# lets select a single cell

bio_data['First_Name'][0] # This will return the first value of the "First_Name" column

# lets try other methods for selecting cells
bio_data.at[0, "First_Name"] # This will also return the first value of the "First_Name" column


# There is still another method using .iat[]
bio_data.iat[0, 0] # This will return the first value of the first column(row0,column0)


Row selection

`iloc` is used to select rows/columns or rows and column using index slicing.
This is very useful especially when your data bset do not have labels(that is, row names and column names)

In [None]:
# Lets select some rows
bio_data.iloc[0:5] # we are selecting from index 0 to the 5th index

In [None]:
# combination of row and column selection
bio_data.iloc[0:5, 0:3] # the first slice picks the rows and the second slice picks the columns

___________________________________________________
**Hands on practice**

Find out when and how to use the `.loc` attribute. And apply it to the dataset.
_________________________________________________

Conditional Filtering

In [None]:
# Filter rows where Gender is 'Female'. This is going to return dataframe
filtered_male = bio_data2[bio_data2['Gender'] == 'Male']
print("Rows where Gender is 'Male':")
filtered_male


In [None]:
# Filter rows where City is 'Lagos' and Course_Track is 'Data Science'
filtered_city = bio_data2[(bio_data2['City'] == 'Lagos') & (bio_data2['Course_Track'] == 'Data Science')]
print("Rows where City is 'Lagos' and Course_Track is 'Data Science':")
filtered_city


In [None]:
# Filter rows where City is either 'Lagos' or 'Abuja'
cities = ['Lagos', 'Abuja']
city_filtered = bio_data2[bio_data2['City'].isin(cities)]
print("Rows where City is either 'Lagos' or 'Abuja':")
city_filtered


Using the .query() method

In [None]:
# Use query() to filter rows where Course_Track is 'AI' and Feedback is 'Excellent'
query_filtered = bio_data2.query("Course_Track == 'AI' and Feedback == 'Excellent'")
print("Rows filtered using query() method:")
query_filtered


In [None]:
# Filter rows where Course_Track is 'Data Science'
data_science = bio_data2.query("Course_Track == 'Data Science'")
print("Students in the Data Science track:")
data_science


In [None]:
# Filter rows using multiple conditions with logical operators
webdev_high_seat_No = bio_data2.query("Seat_No > 110 and Course_Track == 'Web Dev'")
print("Web Dev students with Seat_No greater than 110:")
webdev_high_seat_No


In [None]:
# Filter rows where PC_make is either 'HP' or 'Dell'
hp_dell = bio_data2.query("PC_make in ['HP', 'Dell']")
print("Rows where PC_make is either HP or Dell:")
hp_dell


Sometimes we may want to use a Python variable inside our query. It can be done by prefixing the variable with an @ symbol.

In [None]:
# Define a variable for the course track
desired_track = 'Cloud Computing'

# Use the variable in the query expression
cloud_computing_students = bio_data2.query("Course_Track == @desired_track")
print("Students in the Cloud Computing track:")
cloud_computing_students


Lets filter rows where the Feedback is not "Poor" and the City is "Lagos". Use the != operator for negation.

In [None]:
# Filter rows where Feedback is not 'Poor' and City is 'Lagos'
good_feedback_lagos = bio_data2.query("Feedback != 'Poor' and City == 'Lagos'")
print("Students in Lagos with Feedback other than 'Poor':")
good_feedback_lagos

In [None]:
#LEts create a more complex query filter for Course_Track,Feedback and Seat_No
complex_query = bio_data2.query("Course_Track == 'Data Science' or (Feedback == 'Excellent' and Seat_No < 115)")
print("Complex query result:")
complex_query


### F. Data Transformation

Renaming Column Name

In [None]:
# Lets modify the column names by fixing the old names as keys and the new name as values
bio_data.rename(columns={'First_Name': 'FirstName', 'Last_Name': 'LastName'})

# You can try renaming all the columns by removing all the underscores

Applying String Methods

The `.str` is a string accessor used along siide with the normal methods for manipuating strings such as `.upper()`, `.lower()`, `.ttitle()` etc.

But lets see how they are combined;

```c
.str.upper() # This converts to upper cases or capital letter
```

```c
.str.lower() # This converts to lower cases or small letters
```


```c
.str.title() # This converts to title cases or capitalize first letter of each word
```


```c
.str.strip() # This removes white space before and after a string
```


```c
.str.split() # This splits an iterable into its component parts such as splitting a word into letters and splitting sentences into words using a delimiter.
```

```c
.str.len() # This is used to check the length of an iterable.
```


```c
.str.replace() # This is similar to the find and replace method in excel. It is used for replacing a strings.
```


```c
.str.contains() # This checks if a substrinng is available in a string
```


```c
.str.join() # This is used to join elements of a list into a single string.
```

```c
.str.slice() # This is used to slice strings at a specied index position
```
All of these methods will come very handing during data cleaning and data preprocessing for text data.

Lets apply some of the string methods.

**Note**: To apply some to the entire dataset, we have to define a function(lambda function)

In [None]:
# Lets apply some of the methods to our dataset
bio_data["Feedback"] = bio_data["Feedback"].str.lower() # Here we are converting  everything to small letters.

In [None]:
# lets convert the "PC_Os" to upper case
bio_data["PC_Os"] = bio_data["PC_Os"].str.upper() # Here we are converting  everything to small letters.

In [None]:
# Lets convert the "First_Name"
bio_data["First_Name"] = bio_data["First_Name"].str.title() # Here we are converting the first letters to capital

In [None]:
#LEts view to see if it has applied
bio_data.head()

In [None]:
# LEts define a lambda function
lambda x: x.str.title()

# The .apply() method will help apply the function to the selected columns

bio_data[col] bio_data[col].apply(lambda x: x.str.title())

In [None]:
#LEts view to see if it has applied
bio_data.head()

In [None]:
# We can decide to apply the lambda funtion to every element in the dataset
bio_data.applymap(lambda x: x.str.title())
bio_data.head()

**Assignment**

1. Do a research on `.str.nomalize()` method. Write a 100 words summary of your findings.

2. Look for a accented yoruba text data online, apply the `str.normalize()` method on the dataset and submit before next class.
3. Try out other string operations for tranforming text or string data listed above.

Sorting Values

You can use the `.sort_values()` to sort the dataframe by one or multiple columns.

Below is a typical example of how to use `sort_values()`
```c
bio_data.sort_values(by, axis=1, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False)

# in pandas, axis 0 represents rows
# axis 1 represents columns
# inplace= True, makes the change permament
# na_position = "first", signifies where to put the Nan values

```

In [None]:
# Sorting columns
bio_data.sort_values(by='City', ascending=True)


In [None]:
# lets sort by multiple columns
bio_data.sort_values(by=['City', 'PC_make'], ascending=[True, False])

Sorting by row labels or index

To sort the dataframe by its index or row label the `.sort_index()` method is used.

How to use `.sort_index()` method

```c
bio_data.sort_index(axis = 0, level = None, ascending = False, inplace = False, sort_remaining = True)
```

**If you are curious...try experimenting with this..**

The sort_values() method has a **kind** parameter that allows you to specify which sorting algorithm to use. The available options are:

`quicksort` (This is the default)
`mergesort`
`heapsort`
This can be useful if you are working with very large datasets and need a specific sorting algorithm.
```c
bio_data.sort_values(by='Seat_No', kind='mergesort')
```

### G. Grouping and Aggregation

When it comes to data analysis, grouping and aggregating the data is very useful for insight gathering.

In this section, I will be creating a custom dataset manually to illustrate the examples.

Before that,lets explain some concepts.
For grouping in pandas, we make use of the `groupby()` function. This allows for quick analysis and summarization of our dataset regardless of the size.

How does it work? The function splits thedataset into groups based on the selected column and _applies a function_ to each of the groups, then combine the results.

What are the functions that are applied to it? They are the aggregation functions,
we have the `.agg({})` method and other aggragation functions, which include;

```c
.sum() #Sum of values
.mean() #Mean of values
.median() #Median value
.count() #Number of non-null values
.min() #Minimum value
.max() #Maximum value
.std() #Standard deviation
.var() #Variance
.nunique() #Number of unique values
.get_group() # To retrieve a single group by key
```
The `.agg({})` takes in a key-value pair  of column name and an aggregation function as an argument which could be one or more depending on what you are working on. When using the aggregating functions with the agg({}) method dictionary we dont usually add the round brackets.

Lets create the dataset for our pratice example.

In [None]:
# Lets manually create a bio_data sample data
bio = {
    'First_Name': ['Emeka', 'Aisha', 'Ayo', 'Chinedu', 'Fatima', 'Ibrahim', 'Ngozi', 'Tolu', 'Olamide', 'Yusuf',
                   'Ada', 'Kunle', 'Mercy', 'Segun', 'Zainab', 'Donald', 'Kemi', 'Usman', 'Funmi', 'Chika'],
    'Last_Name': ['Julius', 'Bello', 'Adewale', 'Godswill', 'Abubakar', 'David', 'Collins', 'Ogunleye', 'Adepoju', 'Garba',
                  'Umeh', 'Ojo', 'Musa', 'Balogun', 'Mohammed', 'Obi', 'Adebayo', 'Suleiman', 'Williams', 'Micheal'],
    'Gender': ['Male', 'Female', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male',
               'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male'],
    'Seat_No': range(101, 121),
    'City': ['Lagos', 'Abuja', 'Ibadan', 'Enugu', 'Kano', 'Benin', 'Port Harcourt', 'Abeokuta', 'Benin', 'Abeokuta',
             'Lagos', 'Abeokuta', 'Lagos', 'Ibadan', 'Abuja', 'Port Harcourt', 'Benin', 'Jos', 'Calabar', 'Onitsha'],
    'Course_Track': ['Data Science', 'Cloud Computing', 'Cybersecurity', 'AI', 'Data Science', 'Cloud Computing',
                     'Web Dev', 'AI', 'Cybersecurity', 'AI', 'Data Science', 'Web Dev',
                     'Cybersecurity', 'AI', 'Cloud Computing', 'Data Science', 'Web Dev', 'Data Science',
                     'Data Science', 'Cloud Computing'],
    'PC_make': ['HP', 'Dell', 'HP', 'Asus', 'Apple', 'HP', 'Dell', 'Lenovo', 'Asus', 'Apple',
                'HP', 'Dell', 'Lenovo', 'Asus', 'Dell', 'HP', 'Dell', 'Lenovo', 'Asus', 'Apple'],
    'PC_Os': ['Windows', 'Linux', 'Windows', 'Windows', 'Linux', 'MacOS', 'Windows', 'Linux', 'MacOS', 'Windows',
              'Linux', 'MacOS', 'Windows', 'Linux', 'MacOS', 'Windows', 'Linux', 'MacOS', 'Windows', 'Linux'],
    'Feedback': ['Good', 'Excellent', 'Excellent', 'Good', 'Poor', 'Excellent', 'Good', 'Average', 'Good', 'Excellent',
                 'Good', 'Poor', 'Average', 'Excellent', 'Good', 'Average', 'Excellent', 'Good', 'Good', 'Excellent']
}

In [None]:
# Lets convert it to dataframe first
bio_data2 = pd.DataFrame(bio)

In [None]:
bio_data2.head()

In [None]:
# Lets save it as a CSV file
bio_data2.to_csv("bio_data2.csv", index = False)

In [None]:
bio_data2["Course_Track"].unique()

In [None]:
# What is the total numbers of students taking each track?

track_count = bio_data2.groupby("Course_Track").agg({"First_Name":"count"})
track_count

In [None]:
# What is the total number of students having the same numbers of PC_make
bio_data2.groupby("PC_make")["PC_make"].count()

In [None]:

bio_data2.groupby('Course_Track').agg({'First_Name': 'count', 'PC_make': 'count'})

In [None]:
# What city are track from each track from?
bio_data2.groupby('Course_Track').agg({'City': 'sum'})

In [None]:
# What types of OS do students in each track use?
bio_data2.groupby('Course_Track').agg({'PC_Os': 'sum'})

In [None]:
# What is the most common course among the female students?
female_group = bio_data2.groupby('Gender').get_group('Female')
female_group

In [None]:
female_group["Course_Track"].value_counts()

In [None]:
gender_size= bio_data2.groupby("Gender").size()
gender_size

In [None]:
gender_size= bio_data2.groupby("Course_Track").size()
gender_size

In [None]:
# This will help you to search and return the index of the specified group member

#by_city = bio_data2.groupby("City")
#by_city.groups["Lagos"]

# or

bio_data2.groupby("City").groups["Lagos"]

### H. Data Reshaping

Reshaping data is a key part of data manipulation in pandas. It involves changing the layout or structure of the dataframe without altering the data.

Below are a few, if you are curious, you can do a little bit of research on reshaping pandas dataframe.

```c
pivot() #This is used for reshaping the dataframe. It summerizes you table just like it is in excel spreadsheet.
```


```c
pivot_table() #
```

```c
melt() # converts dataframe from wide format to long format
```

```c
.T # This is used for transposing your dataframe, that is, swapping the rows and the columns
```

Let's take one example here, if time permits, we will solve more examples using a dataset where we can apply this concept.


In [None]:
# Note that is just an illustration of what is possible. It does not make sense to take the mean of Seat_No
pivot_table = pd.pivot_table(bio_data2,
                             index='Gender',
                             columns='Course_Track',
                             values='Seat_No',
                             aggfunc='mean')
print("Pivot Table of Average Seat_No by Gender and Course_Track:")
pivot_table


In [None]:
# LEts transpose this
pivot_table.T

### I. Merging and Joining

Both merging and joining are important techniques that allows you to combine two or more DataFrames based on common columns or indexes.

I will list examples of those functions below and their use cases

```
pd. merge()
```
The merge function give us the SQL feel of joining. We can do inner, left, right, and outer join.
Using merge, we must join both dataframes usinga common column.

In [None]:
# Lets add more details to our bio_data2 dataset by create a new one.



course_data = {
    'Course_Track': ['Data Science', 'Web Dev', 'Cybersecurity', 'AI', 'Cloud Computing'],
    'Duration': ['8 months', '4 months', '5 months', '7 months', '6 months'],
    'Fee': [600000, 350000, 450000, 550000, 500000]
}
course_df = pd.DataFrame(course_data)

# both bio_data2 and course_data have "Course_Track" in common

# Merge the two DataFrames on Course_Track (inner join by default)
merged_df = pd.merge(bio_data2, course_df, on='Course_Track')
print("Merged DataFrame (Inner Join on Course_Track):")
merged_df.head()

In [None]:
# Left join: keep all rows from df
left_joined = pd.merge(bio_data2, course_df, on='Course_Track', how='left')
print("Left Joined DataFrame:")
left_joined.head()

#Observe the output, it seems to be the same with the one above

____________________________________________________
**Hands On Practice**

Try out both right and outer join. Ensure to observe the output and note anything thats seems usual or unusual.
___________________________________________________

There is also another function called .join() method;

```c
.join()
```
This method comes handy when you want to join usinh index. It a convinient way to combine DataFrames that share a common index.

Lets create a new dataset and try joining it with ouir existing dataset

In [None]:
# Create a city DataFrame with details for each unique city in your bio dataset
city_data = {
    'City': ['Lagos', 'Abuja', 'Ibadan', 'Enugu', 'Kano', 'Benin', 'Port Harcourt', 'Abeokuta', 'Jos', 'Calabar', 'Onitsha'],
    'Population': [14000000, 3000000, 5000000, 4000000, 3500000, 2000000, 2500000, 800000, 600000, 500000, 900000],
    'Region': ['South West', 'Federal Capital Territory', 'South West', 'South East', 'North West',
               'South South', 'South South', 'South West', 'North Central', 'South South', 'South East']
}

city_df = pd.DataFrame(city_data)

# Lets set index for the dataset before joining
df_indexed = city_df.set_index("City")

In [None]:
# Lets join this with merged_df

joined_df = df_indexed.join(merged_df, how='left')
print("Joined DataFrame using .join():")
joined_df.head()

# Ensure to note the output

**End of class Assignment**

1. https://www.openml.org/data/download/1586202/phpr1uf8O
2. https://www.openml.org/data/download/29/dataset_29_credit-a.arff
3. https://www.openml.org/data/download/1595261/phpMawTba
4. https://drive.google.com/file/d/11cmclsGbfidJ-ETq5bEK302wK5TyUa8C/view?usp=drive_link

Download any two of the dataset above. Apply what you have learnt so far on the dataset. If you are curious and wishes to explore, you can go overboard.