# Introduction to Pandas Series and DataFrames

## Objectives

* Understand Pandas Series and DataFrames
* Creating Series and DataFrames
* Basic Operations with Series 
* Exploring DataFrame Basics
* Selecting Data from DataFrames
* Applying Functions to Series and DataFrames

## Loading Libraries

In [4]:
# numpys - for arithmetic operations and high-level mathematical functions to operate on arrays
import numpy as np
# pandas - for working with relational or labeled data
import pandas as pd

## What is a Pandas Series?

* **One-Dimensional** labeled Array capable of holding data on any type such as *intergers*, *string*, *float*, *python objects* etc.
* A `pandas series` is like a `column` in a table.


### Key features of a Pandas Series

* **Homogeneous Data**: A Series Holds Data of a single data type(integer, float, string etc), ensuring homogeneity within the Series.
* **Labeled Index**: Each element in a Series is associated with a label called an *index*. Having unique labels is a common practice, though not strictly required. The labels just need to be hashable types, ie they need to be used as keys in a dictionary. This index allows for easy and efficient data retrieval and manipulation.
* **Vectorized Operations**: - Series support vectorized operations, ie you can apply operations to the entire series without the need for explicit loops.
* **Alignment of Data**: - When performing operations on a Series, Pandas automatically aligns data based on index labels, which simplifies data manipulation.
* **Creation**: - Can be created from a List, NumpyArrays, Dictionary, DataFrame slice and other data sources. 

In [3]:
# example of a series from a list 
marks = [10, 40, 50, 23, 19, 45]

# series
marks_series = pd.Series(marks)
marks_series

0    10
1    40
2    50
3    23
4    19
5    45
dtype: int64

## Creating and Displaying

In [4]:
# example 1 - creating a series from a list
data = [22.5, 44.5, 45.5, 9.9, 8.1]

# series
list_series = pd.Series(data, name="Student Marks")

In [5]:
list_series

0    22.5
1    44.5
2    45.5
3     9.9
4     8.1
Name: Student Marks, dtype: float64

In [9]:
# example 2 - creating a series from a NumPy Array
data_arr = np.array(data) # created and array from a list

type(data_arr)

numpy.ndarray

In [10]:
# series from array
arr_series = pd.Series(data_arr, name="Arrays Series")
arr_series

0    22.5
1    44.5
2    45.5
3     9.9
4     8.1
Name: Arrays Series, dtype: float64

In [11]:
# examples 3 - Series from dictionary
data_dict = {
    "Tusker" : 220,
    "Balozi" : 250,
    "Malt" : 300,
    "Smirnoff": 330,
    "Tots" : 150
}

type(data_dict)

dict

In [14]:
dict_series = pd.Series(data_dict, name ="Beer Price")
dict_series

Tusker      220
Balozi      250
Malt        300
Smirnoff    330
Tots        150
Name: Beer Price, dtype: int64

In [45]:
dict_series["Malt"]

300

In [18]:
# series with custom index labels
type(list_series)
balance = [1000, 1500, 2000, 4000, 8000] # data to store in the series
custom_label = ['A', 'B', 'C', 'D', 'E'] # custome indexes

custom_label_series = pd.Series(data= balance, index= custom_label, name = "Balances")
custom_label_series

A    1000
B    1500
C    2000
D    4000
E    8000
Name: Balances, dtype: int64

## Basic Operations With Series

In [20]:
dict_series

Tusker      220
Balozi      250
Malt        300
Smirnoff    330
Tots        150
Name: Beer Price, dtype: int64

In [21]:
print(dict_series[3])

330


In [22]:
custom_label_series

A    1000
B    1500
C    2000
D    4000
E    8000
Name: Balances, dtype: int64

In [23]:
print(custom_label_series["C"])

2000


In [48]:
print(custom_label_series["A":"D"])

A    1000
B    1500
C    2000
D    4000
Name: Balances, dtype: int64


In [26]:
# Arithmetic calculation
# convert balances to percentage
percent = custom_label_series/ 100
percent

A    10.0
B    15.0
C    20.0
D    40.0
E    80.0
Name: Balances, dtype: float64

In [27]:
# filtering element
# values above X
x_filter = percent[percent >= 20]

In [28]:
x_filter

C    20.0
D    40.0
E    80.0
Name: Balances, dtype: float64

In [31]:
percent

A    10.0
B    15.0
C    20.0
D    40.0
E    80.0
Name: Balances, dtype: float64

In [32]:
# mean
mean = percent.mean()
mean

33.0

In [34]:
# std
std = percent.std()
std

28.635642126552707

In [36]:
# max
max = percent.max()
max

80.0

In [39]:
# median
median = percent.median()
median

20.0

In [41]:
# mode
mode = percent.mode()

In [43]:
# summation
summation = percent.sum()
summation

165.0

## Applying Functions to a Series 

### Lambda Functions

* Small anonymous function that is not bound to an identifier.
* Similar to user defined functions but without a name.
* It's simple and straightfoward, requiring only the argument(s) and expression, alongside the keyword `lambda`.
* They require only one line of code.

```
def func_name(parameters):
    code block
    
    return return_value
```

`func = lamda parameters: return_value`

* `lambda` : Keyword that indicates definition of a lambda function.
* `parameters`: The input parameters that the lambda function will take.
* `return_value`: A single expression that defines the compuation the lambda function performs and its return value

In [51]:
# lets compare the two
def square(x):
    #function to square numbers
    out = x ** 2
    
    return out

square(10)

100

In [53]:
# lamda function
square_lamda = lambda x: x ** 2
square_lamda(10)

100

In [56]:
def print_hello():
    y = "Hello world"
    
    return y

print_hello()

'Hello world'

In [57]:
# lambda function to print hello world
message = lambda : "Hello World"

message()

'Hello World'

In [60]:
def even_odd(number: int):
    """ check if a number is even or odd """
    if (number% 2 == 0):
        return 'Even'
    else:
        return 'Odd'
    
even_odd(78)

'Even'

In [64]:
even_lamda = lambda number: "Even" if (number % 2 == 0) else "Odd"
even_lamda(55)

'Odd'

### Generate Random Numbers

* Using `NumPy` library to generate random Numbers.

In [74]:
# generate random numbers 
random_numbers = np.random.randint(3, 99, size=120)

In [81]:
# display random numbers 
random_numbers

array([52, 98, 87, 93, 92, 22, 84, 70, 35, 67, 28, 54, 54, 93, 63, 90,  4,
       98, 72, 34, 55, 45, 31, 40, 12, 18, 17, 37, 90, 25, 13, 45, 96, 61,
       55, 43, 13, 91, 33, 32, 42, 45, 72, 61, 30, 30,  9, 33, 98, 62, 70,
       82, 86, 14, 35, 41, 63, 16, 89, 63, 62, 87, 26, 66, 59, 38, 50, 56,
       58, 62, 24, 75, 98, 66, 46, 28, 86, 19, 41, 70, 46, 48, 73, 60, 68,
       72, 67, 67, 87, 73, 79, 50, 64, 89, 48, 68, 33, 82, 51, 69, 24, 61,
       41, 95, 74, 38, 53, 11, 90, 89, 73, 52, 77,  4, 40, 89, 10, 66, 82,
       35])

In [76]:
type(random_numbers)

numpy.ndarray

In [79]:
# create a series 
numbers = pd.Series(random_numbers, name= "Numbers")
numbers

0      52
1      98
2      87
3      93
4      92
       ..
115    89
116    10
117    66
118    82
119    35
Name: Numbers, Length: 120, dtype: int32

In [80]:
# display the first five rows of the series
numbers.head()

0    52
1    98
2    87
3    93
4    92
Name: Numbers, dtype: int32

In [84]:
numbers.head(3)

0    52
1    98
2    87
Name: Numbers, dtype: int32

In [83]:
# display last five rows
numbers.tail()

115    89
116    10
117    66
118    82
119    35
Name: Numbers, dtype: int32

In [85]:
numbers.tail(3)

117    66
118    82
119    35
Name: Numbers, dtype: int32

### Using the `apply()` Function in a Series

* It's a powerful way to transform and analyze the data within the series.
* Above we have generate a series of random numbers, and created a function called `square` that takes in an int, squares it and return the value. Lets apply that function to the series.

In [86]:
# square the series random numbers 
square(4)

16

In [87]:
# square the series random number
squared_numbers = numbers.apply(square)
squared_numbers

0      2704
1      9604
2      7569
3      8649
4      8464
       ... 
115    7921
116     100
117    4356
118    6724
119    1225
Name: Numbers, Length: 120, dtype: int64

In [91]:
# use .rename to rename the series
squared_numbers.rename("Squared Numbers", inplace=True) # inplace = True makes the changes permanent accross the program

0      2704
1      9604
2      7569
3      8649
4      8464
       ... 
115    7921
116     100
117    4356
118    6724
119    1225
Name: Squared Numbers, Length: 120, dtype: int64

In [89]:
squared_numbers.head()

0    2704
1    9604
2    7569
3    8649
4    8464
Name: Squared Numbers, dtype: int64

### `lambda` function with `apply()`

In [92]:
# Cube the numbers using lambda and apply
cubed_numbers = numbers.apply(lambda j: j **3)

In [93]:
cubed_numbers.head(3)

0    140608
1    941192
2    658503
Name: Numbers, dtype: int64

In [94]:
# rename the series
cubed_numbers.rename("Cubed Numbers", inplace=True)

0      140608
1      941192
2      658503
3      804357
4      778688
        ...  
115    704969
116      1000
117    287496
118    551368
119     42875
Name: Cubed Numbers, Length: 120, dtype: int64

In [95]:
cubed_numbers.head()

0    140608
1    941192
2    658503
3    804357
4    778688
Name: Cubed Numbers, dtype: int64

### Using the `map()` Function in a series

* Used to substitute each value in a Series with another value creating a convenient way to transform the values in a Series.

In [96]:
def bmi_value(x):
    if x <= 43:
        return "Underweight"
    else:
        return "Overweight"
    
bmi_value(30)

'Underweight'

In [97]:
# map our random numbers as underweight or overweight
bmi_series = numbers.map(bmi_value)
bmi_series.rename("BMI Series", inplace=True)
bmi_series.head(10)

0     Overweight
1     Overweight
2     Overweight
3     Overweight
4     Overweight
5    Underweight
6     Overweight
7     Overweight
8    Underweight
9     Overweight
Name: BMI Series, dtype: object

### `lambda` function with `map()`

In [99]:
# use lamda function with map() to double each number
double_number = numbers.map(lambda t: t* 2)
double_number.rename("Doubled Numbers", inplace=True)
double_number.head(5)

0    104
1    196
2    174
3    186
4    184
Name: Doubled Numbers, dtype: int64

### `lamda` function with Conditional Statement

In [102]:
# are the random numbers even or odd
even_odd_series = numbers.apply(lambda k: 'Even' if (k % 2 == 0) else 'Odd')
even_odd_series.rename("Even Odd Series", inplace=True)
even_odd_series.tail(5)

115     Odd
116    Even
117    Even
118    Even
119     Odd
Name: Even Odd Series, dtype: object

## Series to DataFrame 

* `if` a **Series** is a *table* with a single column, `elif` a **DataFrame** is a *table* with two or more columns.

In [103]:
# lets convert all the series we created into a dataframe
print(numbers.name)
print(double_number.name)

Numbers
Doubled Numbers


In [108]:
test_df = pd.DataFrame({
    numbers.name : numbers,
    double_number.name : double_number,
    squared_numbers.name : squared_numbers,
    cubed_numbers.name : cubed_numbers,
    bmi_series.name : bmi_series,
    even_odd_series.name : even_odd_series
})

test_df.tail()

Unnamed: 0,Numbers,Doubled Numbers,Squared Numbers,Cubed Numbers,BMI Series,Even Odd Series
115,89,178,7921,704969,Overweight,Odd
116,10,20,100,1000,Underweight,Even
117,66,132,4356,287496,Overweight,Even
118,82,164,6724,551368,Overweight,Even
119,35,70,1225,42875,Underweight,Odd


## Knock Yourself Out!

You work as a real estate agent at *MoringaHome Realty*. To assist your clients in making informed decisions about property investment, you decide to analyze property data using Pandas. 
1. Generate 120 random numbers between  Ksh 4000 and Ksh 20,000 using numpy to represent the prices of the houses. 
2. Display the first and last 7 houses.
3. Create a function that will take in the price of the house and return the category of that house, eg Suburb. The category is of your own series.
4. Apply the function created above to the series.
6. Apply a lambda function to increase the property prices by 10% due to the new tax laws.
7. Apply a custom function to increase the property prices by and additional Ksh 250 for garbage. 
8. Create a new Series for each step and Finally Combine them all into a DataFrame name 'Moringa_property'.

In [5]:
# Generate 120 random numbers between Ksh 4000 and Ksh 20,000 using numpy to represent the prices of the houses.
rand_houses = np.random.randint(4000, 20000, 120)
rand_houses

array([12404, 16037, 12495, 10199,  6996, 13736, 13530,  4984, 16729,
       19462, 14046, 17076, 18518, 16358, 16699,  8881,  4818,  6144,
       18506, 19238, 15096, 12963,  4844, 15328,  7586,  7652, 12095,
       18122, 12046,  9800,  9826,  8203,  6220, 10523,  6803, 11823,
       10297,  5058, 19266, 17098, 16716, 14107, 15756, 11759,  4128,
       12903,  4421,  7267,  7573,  9441, 10534,  9646, 16200, 16461,
       11101, 11890,  4982,  9617,  8210, 10539, 11077, 10556,  5334,
       19132, 13534, 15968, 10768,  9402, 19981, 11224,  9972,  4208,
        6866, 14371,  7589,  5828, 13812, 18645, 14719, 13753,  8421,
       12618, 10856, 17474, 11270, 14664,  4496, 14107, 19944, 13908,
        6397, 18633,  4605,  8167, 18140,  5033, 10038,  6402, 16100,
       15150,  4162, 13517,  5249, 12687, 14438, 16578,  5117,  7392,
       18916, 14073, 13929,  7676,  7382, 11282,  8561, 16553, 17049,
       18844,  7706,  9859])

In [6]:
available_houses = pd.Series(rand_houses, name="Available Houses")
available_houses

0      12404
1      16037
2      12495
3      10199
4       6996
       ...  
115    16553
116    17049
117    18844
118     7706
119     9859
Name: Available Houses, Length: 120, dtype: int32

In [7]:
# Display the first and last 7 houses.
available_houses.head(7)

0    12404
1    16037
2    12495
3    10199
4     6996
5    13736
6    13530
Name: Available Houses, dtype: int32

In [8]:
available_houses.tail(7)

113    11282
114     8561
115    16553
116    17049
117    18844
118     7706
119     9859
Name: Available Houses, dtype: int32

In [9]:
# Create a function that will take in the price of the house and return the category of that house, eg Suburb. The category is of your own series.
def rent_value(x):
    if x <= 25000:
        return "Middle Class"
    elif x > 25000 and x <=40000:
        return "Suburb"
    else:
        return "House not Available"

# rent_value = available_houses.apply(lambda x: 'Suburb' if (x >= 30000) else 'Middle Class')
    

In [10]:
# Apply the function created above to the series.
rent_series = available_houses.map(rent_value)

In [11]:
rent_series.rename("Rent Range", inplace=True)
rent_series.head(7)

0    Middle Class
1    Middle Class
2    Middle Class
3    Middle Class
4    Middle Class
5    Middle Class
6    Middle Class
Name: Rent Range, dtype: object

In [12]:
# Apply a lambda function to increase the property prices by 10% due to the new tax laws.
new_rent = available_houses.apply(lambda r: r * 1.1)
new_rent.rename("Adjusted Rent", inplace = True)

0      13644.4
1      17640.7
2      13744.5
3      11218.9
4       7695.6
        ...   
115    18208.3
116    18753.9
117    20728.4
118     8476.6
119    10844.9
Name: Adjusted Rent, Length: 120, dtype: float64

In [13]:
new_rent.head(10)

0    13644.4
1    17640.7
2    13744.5
3    11218.9
4     7695.6
5    15109.6
6    14883.0
7     5482.4
8    18401.9
9    21408.2
Name: Adjusted Rent, dtype: float64

In [14]:
# Apply a custom function to increase the property prices by and additional Ksh 250 for garbage.
def garbage_fee(available_houses):
    return available_houses + 250

In [15]:
garbage_series = new_rent.map(garbage_fee)
garbage_series.rename("Rent + Garbage", inplace=True)
garbage_series.head(7)

0    13894.4
1    17890.7
2    13994.5
3    11468.9
4     7945.6
5    15359.6
6    15133.0
Name: Rent + Garbage, dtype: float64

In [16]:
# Create a new Series for each step and Finally Combine them all into a DataFrame name 'Moringa_property'.
# Government intervention
gov_rent = available_houses.apply(lambda k: k * 0.6)
gov_rent.rename("Government Rates", inplace=True)
gov_rent.tail(10)

110     8357.4
111     4605.6
112     4429.2
113     6769.2
114     5136.6
115     9931.8
116    10229.4
117    11306.4
118     4623.6
119     5915.4
Name: Government Rates, dtype: float64

In [17]:
# Service charge of Kes. 500 introduced
def serv_charge(gov_rent):
    return gov_rent + 500

In [18]:
serv_rent = gov_rent.map(serv_charge)
serv_rent.rename("Goverment & Service", inplace= True)
serv_rent.tail(10)

110     8857.4
111     5105.6
112     4929.2
113     7269.2
114     5636.6
115    10431.8
116    10729.4
117    11806.4
118     5123.6
119     6415.4
Name: Goverment & Service, dtype: float64

In [19]:
Moringa_Property = pd.DataFrame({
    rent_series.name : rent_series,
    available_houses.name : available_houses,
    new_rent.name : new_rent,
    garbage_series.name : garbage_series,
    gov_rent.name : gov_rent,
    serv_rent.name : serv_rent
})

In [20]:
Moringa_Property.head(10)

Unnamed: 0,Rent Range,Available Houses,Adjusted Rent,Rent + Garbage,Government Rates,Goverment & Service
0,Middle Class,12404,13644.4,13894.4,7442.4,7942.4
1,Middle Class,16037,17640.7,17890.7,9622.2,10122.2
2,Middle Class,12495,13744.5,13994.5,7497.0,7997.0
3,Middle Class,10199,11218.9,11468.9,6119.4,6619.4
4,Middle Class,6996,7695.6,7945.6,4197.6,4697.6
5,Middle Class,13736,15109.6,15359.6,8241.6,8741.6
6,Middle Class,13530,14883.0,15133.0,8118.0,8618.0
7,Middle Class,4984,5482.4,5732.4,2990.4,3490.4
8,Middle Class,16729,18401.9,18651.9,10037.4,10537.4
9,Middle Class,19462,21408.2,21658.2,11677.2,12177.2


In [21]:
Moringa_Property.tail(10)

Unnamed: 0,Rent Range,Available Houses,Adjusted Rent,Rent + Garbage,Government Rates,Goverment & Service
110,Middle Class,13929,15321.9,15571.9,8357.4,8857.4
111,Middle Class,7676,8443.6,8693.6,4605.6,5105.6
112,Middle Class,7382,8120.2,8370.2,4429.2,4929.2
113,Middle Class,11282,12410.2,12660.2,6769.2,7269.2
114,Middle Class,8561,9417.1,9667.1,5136.6,5636.6
115,Middle Class,16553,18208.3,18458.3,9931.8,10431.8
116,Middle Class,17049,18753.9,19003.9,10229.4,10729.4
117,Middle Class,18844,20728.4,20978.4,11306.4,11806.4
118,Middle Class,7706,8476.6,8726.6,4623.6,5123.6
119,Middle Class,9859,10844.9,11094.9,5915.4,6415.4
