# Pandas Overview
Let's recap the essentials...

## What is Pandas and what is used for?

It is a Python library designed for handling datasets. It offers a variety of tools for analysing, cleaning, exploring, and manipulating data.

The term **"Pandas"** derives its name from a combination of **"Panel Data"** and **"Python Data Analysis"**.

It is a newer package built on top of NumPy array data structure, and in particular its `Series` and `DataFrame` objects. It provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.

Pandas objects can be thought of as enhanced versions of NumPy structured arrays. NumPy is limited where flexibility is required:
- Attaching labels to data
- Working with missing data
- Attempting operations that do not map well to element-wise broadcasting e.g grouping

## What are the fundamental Pandas data structures for handling data?

Series, DataFrame, and Index.

- **Series**: a one-dimensional labeled (indexed) array holding data of any type such as integers and strings. It can be created from a list or array.
- **DataFrame**: a two-dimensional labeled data structure that holds data like a two-dimension array.

# The Pandas Series Object

In [1]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [2]:
# Series as a generalised NumPy array
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [3]:
# Series as specialised dictionary
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

# The Pandas DataFrame Object

## What is DataFrame?

DataFrames are essentially two-dimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.

Pandas provides an efficient implementation of a `DataFrame`.
- Convenient storage interface for labelled data.
- Provides powerful data operations familiar to users of database and spreadsheet programs.

In [4]:
# Series as specialised dictionary
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [5]:
# let's construct a new Series
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [6]:
# DataFrame as a generalised NumPy array
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [7]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

## Constructing a DataFrame object
There are a number of ways to construct a DataFrame:

1. From a collection of `Series` objects.

2. A single `DataFrame` can be constructed from a single `Series`.

3. It can also be constructed from a __list of dictionaries__

4. A `DataFrame` can be constructed from a __dictionary of Series objects__

5. You can create a `DataFrame`from a __two dimensional array__, with specfied column and index names. (If ommitted, an integer index will be used instead)

6. From a __NumPy structred array__

# The Pandas Index Object

Both the `Series` and `DataFrame` objects contain an explicit *index* that lets you reference and modify data.

The `Index` object is an immutable array

## Index as an ordered set

In [8]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [9]:
indA

Int64Index([1, 3, 5, 7, 9], dtype='int64')

In [10]:
indB

Int64Index([2, 3, 5, 7, 11], dtype='int64')

In [11]:
pd.__version__

'1.5.3'

In [12]:
import numpy as np
np.__version__

'1.23.5'

In [13]:
indA & indB  # intersection

  indA & indB  # intersection


Int64Index([3, 5, 7], dtype='int64')

In [15]:
indA.intersection(indB)

Int64Index([3, 5, 7], dtype='int64')

In [16]:
indA | indB  # union

  indA | indB  # union


Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [17]:
indA.union(indB)

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [20]:
indA ^ indB  # symmetric difference

  indA ^ indB  # symmetric difference


Int64Index([1, 2, 9, 11], dtype='int64')

In [22]:
indA.symmetric_difference(indB)

Int64Index([1, 2, 9, 11], dtype='int64')

# Data Indexing and Selection

__indexer attributes__

- `loc` allows indexing that always references the explicit
- `iloc` attribute allows indexing to refer to the implicit Python-style index

# Missing Data in Pandas

Strategies for handling missing data in general revolve around two approaches:
- Masking the missing values
- Choosing a sentinel value indicative of a missing entry

### Trade-offs

- Masks: requires allocation of a Boolean array which adds overhead in terms of storage and computation
- Sentinel values: reduces the range of values that can be represented.

## `None`: Pythonic missing data

A Python singleton object

Can be used in arrays with data type `object`

In [23]:
import numpy as np
import pandas as pd

vals1 = np.array([1, None, 3, 4]) # None
vals1

array([1, None, 3, 4], dtype=object)

## `NaN`: Missing numerical data

A special floating-point value recognized by all systems that use the standard IEEE floating-point representation

In [24]:
vals2 = np.array([1, np.nan, 3, 4])
print(vals2) # nan
vals2.dtype

[ 1. nan  3.  4.]


dtype('float64')

## NaN and None in Pandas

`NaN` and `None` both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate:

In [25]:
pd.Series([1, np.nan, 2, None]) # Nan

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

## Detecting null values

Pandas data structures have two useful methods for detecting null data: `isnull()` and `notnull()`. Either one will return a Boolean mask over the data.

In [26]:
pd.Series([1, np.nan, 2, None]).isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [27]:
pd.Series([1, np.nan, 2, None]).notnull()

0     True
1    False
2     True
3    False
dtype: bool

## Dropping null values

In addition to the masking used before, there are the convenience methods, `dropna()` (which removes NA values) and `fillna()` (which fills in NA values)

We cannot drop single values from a DataFrame; we can only drop full rows or full columns.

In [31]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df_copy = df.copy()
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [32]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


## Filling null values

Pandas provides the `fillna()` method, which returns a copy of the array with the null values replaced.

In [34]:
df_copy.fillna(6)

Unnamed: 0,0,1,2
0,1.0,6.0,2
1,2.0,3.0,5
2,6.0,4.0,6


# Exercises:

1. Convert Dictionary1 to a Pandas series.
   
   Dictionary1 = {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5}

In [36]:
import pandas as pd
Dictionary1 = {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5}
series1 = pd.Series(Dictionary1)
series1

A    1
B    2
C    3
D    4
E    5
dtype: int64

2. Access the Pandas' series object in Q1 by displaying the fifth index and the sixth assigned new index value.

In [37]:
series1 ['F'] = 6
series1 [4:6]

E    5
F    6
dtype: int64

3. Create a Pandas series with a scalar data to fill 5 indices of String type

In [38]:
pd.Series(5, index=['A', 'B', 'C', 'D', 'E'])

A    5
B    5
C    5
D    5
E    5
dtype: int64

In [39]:
pd.Series(6, index=['a', 'b', 'c'])

a    6
b    6
c    6
dtype: int64

4. Convert the following list of dictionaries into a Pandas DataFrame:

Quiz_Marks = [
    {'st_name': 'A', 'st_mark': 100},
    {'st_name': 'B', 'st_mark': 90},
    {'st_name': 'C', 'st_mark': 80},
    {'st_name': 'D', 'st_mark': 70},
    {'st_name': 'E', 'st_mark': 60},
]

labels = ['st1', 'st2', 'st3', 'st4', 'st5']


In [42]:
Quiz_Marks = [
    {'st_name': 'A', 'st_mark': 100},
    {'st_name': 'B', 'st_mark': 90},
    {'st_name': 'C', 'st_mark': 80},
    {'st_name': 'D', 'st_mark': 70},
    {'st_name': 'E', 'st_mark': 60},
]
labels = ['st1', 'st2', 'st3', 'st4', 'st5']
df = pd.DataFrame(Quiz_Marks , index=labels)
df

Unnamed: 0,st_name,st_mark
st1,A,100
st2,B,90
st3,C,80
st4,D,70
st5,E,60


5. Given the Quiz_Marks DataFrame of Q4; select the 'st_mark' column from the Quiz_Marks DataFrame.

In [43]:
df[['st_mark']]

Unnamed: 0,st_mark
st1,100
st2,90
st3,80
st4,70
st5,60


In [44]:
df['st_mark']

st1    100
st2     90
st3     80
st4     70
st5     60
Name: st_mark, dtype: int64

6. Given the Quiz_Marks DataFrame of Q4; select the the first and the last marks of st_mark column

In [45]:
df.iloc[[0, 4], [1]]

Unnamed: 0,st_mark
st1,100
st5,60


7. Create a pandas DataFrame object where the second column data are the square values of the first column data (multiples of 3 up to 10)

In [46]:
data = [{'3 times table': i*3, 'Square of 3 times table': (i*3)**2}
        for i in range(10)]
pd.DataFrame(data)

Unnamed: 0,3 times table,Square of 3 times table
0,0,0
1,3,9
2,6,36
3,9,81
4,12,144
5,15,225
6,18,324
7,21,441
8,24,576
9,27,729


In [47]:
[{'Even': i*2, 'Squared_even': (i*2)**2} for i in range(6)]

[{'Even': 0, 'Squared_even': 0},
 {'Even': 2, 'Squared_even': 4},
 {'Even': 4, 'Squared_even': 16},
 {'Even': 6, 'Squared_even': 36},
 {'Even': 8, 'Squared_even': 64},
 {'Even': 10, 'Squared_even': 100}]

8. Based on the below series:

`data = pd.Series([100, 200, 300, 400], index=['A', 'B', 'C', 'D'])`

Is slicing with an explicit index (i.e., `data['A':'C']`) display the same number of indices as when slicing with an implicit index (i.e., `data[0:2]`)?

In [48]:
import pandas as pd
data = pd.Series([100, 200, 300, 400],
                 index=['A', 'B', 'C', 'D'])
data

A    100
B    200
C    300
D    400
dtype: int64

In [49]:
# Up to and inclusive
data['A':'C']

A    100
B    200
C    300
dtype: int64

In [51]:
# Up to and inclusive
data.loc['A':'C']

A    100
B    200
C    300
dtype: int64

In [50]:
# Up to but exclusive
data.iloc[0:2]

A    100
B    200
dtype: int64

In explicit index the final index is included in the slice, while in the implicit index the final index is excluded from the slice.

9. Based on the below two indices, print the items which are not common of two given indices.

    `ind1 = pd.Index([1, 2, 3, 4, 5])`

    `ind2 = pd.Index([1, 3, 4, 9, 10])`

In [52]:
ind1 = pd.Index([1, 2, 3, 4, 5])
ind2 = pd.Index([1, 3, 4, 9, 10])

#ind1 ^ ind2

ind1.symmetric_difference(ind2)

Int64Index([2, 5, 9, 10], dtype='int64')

10. Based on the below two series, print the items which are not common of two given series.

   `series1 = pd.Series([1, 2, 3, 4, 5])`
   
   `series2 = pd.Series([1, 3, 4, 9, 10])`

In [53]:
import pandas as pd
import numpy as np
series1 = pd.Series([1, 2, 3, 4, 5])
series2 = pd.Series([1, 3, 4, 9, 10])
sr1 = pd.Series(np.union1d(series1, series2))
print(sr1)
sr2 = pd.Series(np.intersect1d(series1, series2))
print(sr2)
items = sr1[~sr1.isin(sr2)] # the ~ operator is used to negate a condition.
# In this specific context, it's used to filter the elements in a Pandas Series sr1 based on whether they are not present in another Pandas Series sr2.
items

0     1
1     2
2     3
3     4
4     5
5     9
6    10
dtype: int64
0    1
1    3
2    4
dtype: int64


1     2
4     5
5     9
6    10
dtype: int64

11. Count the NaN values in all columns of the below DataFrame, then replace all the NaN values with Zero's.

        `df = pd.DataFrame([[1,      np.nan, 2],
                           [2,      3,      5],
                           [np.nan, 4,      6]])`

In [56]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])

print("The number of NaN values:\n", df.isnull().values.sum())

df =  df.fillna(0)
print("\ndf DataFrame after replacing all NaN with 0:")
print(df)

The number of NaN values:
 2

df DataFrame after replacing all NaN with 0:
     0    1  2
0  1.0  0.0  2
1  2.0  3.0  5
2  0.0  4.0  6


12. Write a Pandas program to join the below given dataframes (`std_info1`, `std_info2`) along rows.

In [57]:
import pandas as pd

std_info1 = pd.DataFrame({
        'std_id': ['std1', 'std2', 'std3', 'std4', 'std5'],
         'std_name': ['Adam', 'Noah', 'Peter', 'Matthew', 'Ahmad'],
        'std_mark': [50, 60, 70, 80, 90]})

std_info2 = pd.DataFrame({
        'std_id': ['std6', 'std7', 'std8', 'std9', 'std10'],
        'std_name': ['Georgia', 'Bethan', 'Daniella', 'Rima', 'Salma'],
        'std_mark': [55, 65, 75, 85, 95]})

In [58]:
concat_df = pd.concat([std_info1, std_info2])
concat_df

Unnamed: 0,std_id,std_name,std_mark
0,std1,Adam,50
1,std2,Noah,60
2,std3,Peter,70
3,std4,Matthew,80
4,std5,Ahmad,90
0,std6,Georgia,55
1,std7,Bethan,65
2,std8,Daniella,75
3,std9,Rima,85
4,std10,Salma,95


13. Join the two given dataframes (`std_info1`, `std_info2`) along columns

In [59]:
concat_df = pd.concat([std_info1, std_info2], axis = 1)
concat_df

Unnamed: 0,std_id,std_name,std_mark,std_id.1,std_name.1,std_mark.1
0,std1,Adam,50,std6,Georgia,55
1,std2,Noah,60,std7,Bethan,65
2,std3,Peter,70,std8,Daniella,75
3,std4,Matthew,80,std9,Rima,85
4,std5,Ahmad,90,std10,Salma,95


14. Append extra row/s to `std_info2` DataFrame.

In [None]:
std11 = pd.Series(['std11', 'Mira', 100], index=['std_id', 'std_name', 'std_mark'])
df_appended_row = std_info2._append(std11, ignore_index = True)
df_appended_row

Unnamed: 0,std_id,std_name,std_mark
0,std6,Georgia,55
1,std7,Bethan,65
2,std8,Daniella,75
3,std9,Rima,85
4,std10,Salma,95
5,std11,Mira,100


15. Append a list of dictionaries to `std_info1` DataFrame.

In [61]:
dicts = [{'std_id': 'std6', 'std_name': 'Amer', 'std_mark': 100},
         {'std_id': 'std7', 'std_name': 'Andy', 'std_mark': 100}]
df_appended_dicts = std_info1._append(dicts, ignore_index = True)
df_appended_dicts

Unnamed: 0,std_id,std_name,std_mark
0,std1,Adam,50
1,std2,Noah,60
2,std3,Peter,70
3,std4,Matthew,80
4,std5,Ahmad,90
5,std6,Amer,100
6,std7,Andy,100


16. Join the given dataframes (`std_info1`, `std_info2`) along rows and merge with a new dataframe (`module_id`) along the `std_id` column.

In [62]:
module_data = pd.DataFrame({
        'std_id': ['std1', 'std2', 'std3', 'std4', 'std5', 'std6', 'std7', 'std8', 'std9', 'std10'],
        'module_id': [1, 1, 1, 2, 2, 1, 1, 1, 2, 2]})

concat_data = pd.concat([std_info1, std_info2], keys=['Group 1', 'Group 2']) # review
print ("Concatenated DataFrames before merging it with module_id\n\n",concat_data)

merged_data = pd.merge(concat_data, module_data, on='std_id') # review
print ("\nMerged DataFrames\n\n", merged_data)

Concatenated DataFrames before merging it with module_id

           std_id  std_name  std_mark
Group 1 0   std1      Adam        50
        1   std2      Noah        60
        2   std3     Peter        70
        3   std4   Matthew        80
        4   std5     Ahmad        90
Group 2 0   std6   Georgia        55
        1   std7    Bethan        65
        2   std8  Daniella        75
        3   std9      Rima        85
        4  std10     Salma        95

Merged DataFrames

   std_id  std_name  std_mark  module_id
0   std1      Adam        50          1
1   std2      Noah        60          1
2   std3     Peter        70          1
3   std4   Matthew        80          2
4   std5     Ahmad        90          2
5   std6   Georgia        55          1
6   std7    Bethan        65          1
7   std8  Daniella        75          1
8   std9      Rima        85          2
9  std10     Salma        95          2


17. Join the below two given dataframes with available matching records from both sides.

In [63]:
import pandas as pd

std_info1 = pd.DataFrame({
        'std_id': ['std1', 'std2', 'std3', 'std4', 'std5'],
         'std_name': ['Adam', 'Noah', 'Peter', 'Matthew', 'Ahmad'],
        'std_mark': [50, 60, 70, 80, 90]})

std_info2 = pd.DataFrame({
        'std_id': ['std4', 'std5', 'std6', 'std7', 'std8'],
        'std_name': ['Georgia', 'Bethan', 'Daniella', 'Rima', 'Salma'],
        'std_mark': [55, 65, 75, 85, 95]})

In [64]:
std_info1

Unnamed: 0,std_id,std_name,std_mark
0,std1,Adam,50
1,std2,Noah,60
2,std3,Peter,70
3,std4,Matthew,80
4,std5,Ahmad,90


In [65]:
std_info2

Unnamed: 0,std_id,std_name,std_mark
0,std4,Georgia,55
1,std5,Bethan,65
2,std6,Daniella,75
3,std7,Rima,85
4,std8,Salma,95


In [66]:
merged_data = pd.merge(std_info1, std_info2, on='std_id', how='outer') # review how="outer"
merged_data

Unnamed: 0,std_id,std_name_x,std_mark_x,std_name_y,std_mark_y
0,std1,Adam,50.0,,
1,std2,Noah,60.0,,
2,std3,Peter,70.0,,
3,std4,Matthew,80.0,Georgia,55.0
4,std5,Ahmad,90.0,Bethan,65.0
5,std6,,,Daniella,75.0
6,std7,,,Rima,85.0
7,std8,,,Salma,95.0


In [67]:
merged_data = pd.merge(std_info1, std_info2, on='std_id', how='inner')
merged_data

Unnamed: 0,std_id,std_name_x,std_mark_x,std_name_y,std_mark_y
0,std4,Matthew,80,Georgia,55
1,std5,Ahmad,90,Bethan,65


## Based on the uploaded three CSV files, answer the following questions:

Read the three CSV files.

In [84]:
from google.colab import drive
drive.mount('/content/drive')
file_path = "/content/drive/MyDrive/Colab Notebooks/ai-programming/week 6/"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [85]:
import os
# Change directory
os.chdir(file_path)

# Print
print(os.getcwd())
os.listdir()

/content/drive/MyDrive/Colab Notebooks/ai-programming/week 6


['Pandas Overview.ipynb',
 'sales_info.csv',
 'products_info.csv',
 'customers_info.csv',
 'NumPy Overview.ipynb',
 'Numpy Overview Questions-Solutions.ipynb',
 'Numpy Overview Questions.ipynb',
 'Pandas Overview Question-Solutions.ipynb',
 'Pandas Overview Question.ipynb']

In [86]:
import pandas as pd
sales=pd.read_csv("sales_info.csv") # reading from csv file
print(sales, "\n")

products=pd.read_csv("products_info.csv") # reading from csv file
print(products, "\n")

customer=pd.read_csv("customers_info.csv") # reading from csv file
print(customer)

   sale_id  customer_id  product_id product_name  Quantity store_code
0        1            2           3        Phone         2         A1
1        2            2           4     Computer         1         B2
2        3            1           3        Phone         3         A1
3        4            4           2       Laptop         2         B2
4        5            2           3        Phone         3         A1
5        6            3           3        Phone         2         B2
6        7            2           2       Laptop         3         A1
7        8            3           2       Laptop         2         B2
8        9            2           3        Phone         2         A1 

   product_id product_name  product_price
0           1       Tablet            800
1           2       Laptop           2000
2           3        Phone           1000
3           4     Computer           3000
4           5       Fridge            250
5           6          PS5            200
6   

18. Show the `Quantity` of Products sold for each product.

In [92]:
print(sales.groupby(['product_name','product_id'])[['Quantity']].sum())

                         Quantity
product_name product_id          
Computer     4                  1
Laptop       2                  7
Phone        3                 12


19. Show the total `Sales` and `Quantity` for each product

In [88]:
sales_group=sales.groupby(['product_name','product_id', 'store_code'])[['Quantity']].sum()
sales_sum=pd.merge(sales_group,products,how='left',on='product_id')
sales_sum['Total_Sales']=sales_sum['Quantity']*sales_sum['product_price']
print (sales_sum)

   product_id  Quantity product_name  product_price  Total_Sales
0           4         1     Computer           3000         3000
1           2         3       Laptop           2000         6000
2           2         4       Laptop           2000         8000
3           3        10        Phone           1000        10000
4           3         2        Phone           1000         2000


20. Show the `Quantity` of each sold product for each `Store`

In [93]:
store = pd.merge(sales, customer, how='left', on='customer_id')
store.groupby(['product_name', 'product_id', 'store_code'])[['Quantity']].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Quantity
product_name,product_id,store_code,Unnamed: 3_level_1
Computer,4,B2,1
Laptop,2,A1,3
Laptop,2,B2,4
Phone,3,A1,10
Phone,3,B2,2


In [None]:
print(sales_group.groupby(['product_name','product_id','store_code'])[['Quantity']].sum())

                                    Quantity
product_name product_id store_code          
Computer     4          B2                 1
Laptop       2          A1                 3
                        B2                 4
Phone        3          A1                10
                        B2                 2
