## Day 1

In [59]:
import pandas as pd
import numpy as np


### Series

It is a one-dimensional of data. It is like a single column in a table.

You can have different types of data in a series, but be carefull if you want
only numerical data to be able to do some calculations.

In [2]:
# Let's create a series
dogs_breed_series = pd.Series(['French poodle', 'Bulldog', 'Labrador retriever', 789, 'Dachshund'])

In [3]:
dogs_breed_series

0         French poodle
1               Bulldog
2    Labrador retriever
3                   789
4             Dachshund
dtype: object

In [4]:
dogs_weigth_series = pd.Series([3, 4, 67, 23, 39, 19, 10, 9])

In [5]:
dogs_weigth_series


0     3
1     4
2    67
3    23
4    39
5    19
6    10
7     9
dtype: int64

Notice here the dtype is an integer here, while on the previous one it was an object (string will be see as an object in pandas).

The indexes (0, 1, 2, 3, ...) will be automatically assigned. BUt you can change that, like this:

In [6]:

# We can assign different indexes
dogs_weigth_serie = pd.Series(
    [3, 4, 67, 23, 39, 19, 10, 9],
    index = ['Cookie', 'Biscuit', 'Pepper', 'Apollo', 'Ginger', 'Ruby',
             "Spark", 'Peach'])

In [7]:
dogs_weigth_serie

Cookie      3
Biscuit     4
Pepper     67
Apollo     23
Ginger     39
Ruby       19
Spark      10
Peach       9
dtype: int64

## Statistics

Let's calculate the mean, median and mode.

These are function to study central tendency.

In [8]:
dogs_weigth_serie.mean()

21.75

In [9]:
dogs_weigth_serie.median()

14.5

In [10]:
dogs_weigth_serie.mode()

0     3
1     4
2     9
3    10
4    19
5    23
6    39
7    67
dtype: int64

In [11]:
# If we want to study the spread let use the standard deviation and the variance
dogs_weigth_serie.std()

21.783020910791965

In [12]:
dogs_weigth_serie.var()

474.5

# Dataframes

Dataframes are two-dimensional structures (rows, cols).

It has operations that allow you to manipulate numerical tables and time series.

Dataframes will be like you see in excel, but it will be more powerful.

There are multiple ways to define a dataframe.

In [13]:
# You can defined the dataframe by defining each series
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data = d)

In [14]:
df, df.dtypes

(   col1  col2
 0     1     3
 1     2     4,
 col1    int64
 col2    int64
 dtype: object)

In [15]:
# You can define the type of all the columns at once
df = pd.DataFrame(data=d, dtype=np.int8)
df, df.dtypes

(   col1  col2
 0     1     3
 1     2     4,
 col1    int8
 col2    int8
 dtype: object)

In [16]:
# You can define in the same ways, but directly including a Series
d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
df = pd.DataFrame(data=d, index=[0, 1, 2, 3])
df

Unnamed: 0,col1,col2
0,0,
1,1,
2,2,2.0
3,3,3.0


In [17]:
# Building from a numpy ndarray
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
    columns=['a', 'b', 'c'])
df2

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


In [18]:
# But the more practical approach is to build it from a list of dict
# This way you are sure that the data in each row is correct
df3 = pd.DataFrame.from_records(
    [{'points': 50, 'time': '5:00', 'year': 2010}, 
     {'points': 25, 'time': '6:00', 'month': "february", 'year': 2010}, 
     {'points':90, 'time': '9:00', 'month': 'january', 'year': 2020}, 
     {'points_h1':20, 'month': 'june', 'year': 2020}]
)
df3

Unnamed: 0,points,time,year,month,points_h1
0,50.0,5:00,2010,,
1,25.0,6:00,2010,february,
2,90.0,9:00,2020,january,
3,,,2020,june,20.0


In [19]:
#dtypes to see the types of each columns
df3.dtypes

points       float64
time          object
year           int64
month         object
points_h1    float64
dtype: object

In [20]:
# columns to see the name of all the columns. Some times you have
# so many columns that you do not see all in a screen.
df3.columns

Index(['points', 'time', 'year', 'month', 'points_h1'], dtype='object')

In [21]:
#shape
df3.shape

(4, 5)

In [22]:
#info
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   points     3 non-null      float64
 1   time       3 non-null      object 
 2   year       4 non-null      int64  
 3   month      3 non-null      object 
 4   points_h1  1 non-null      float64
dtypes: float64(2), int64(1), object(2)
memory usage: 292.0+ bytes


In [23]:
#describe, to show some statistic 
df3.describe()

Unnamed: 0,points,year,points_h1
count,3.0,4.0,1.0
mean,55.0,2015.0,20.0
std,32.787193,5.773503,
min,25.0,2010.0,20.0
25%,37.5,2010.0,20.0
50%,50.0,2015.0,20.0
75%,70.0,2020.0,20.0
max,90.0,2020.0,20.0


In [24]:
#Head if you want to display just the first row, tail to display the last rows
df3.head(2)

Unnamed: 0,points,time,year,month,points_h1
0,50.0,5:00,2010,,
1,25.0,6:00,2010,february,


In [25]:
df3.tail(3)

Unnamed: 0,points,time,year,month,points_h1
1,25.0,6:00,2010,february,
2,90.0,9:00,2020,january,
3,,,2020,june,20.0


In [26]:
df3[1:3]

Unnamed: 0,points,time,year,month,points_h1
1,25.0,6:00,2010,february,
2,90.0,9:00,2020,january,


In [27]:
# You can select data from one column
df3['points']

0    50.0
1    25.0
2    90.0
3     NaN
Name: points, dtype: float64

In [28]:
# from multiple columns
df3[['points', 'time']]

Unnamed: 0,points,time
0,50.0,5:00
1,25.0,6:00
2,90.0,9:00
3,,


In [29]:
# You can use some conditions to filter the dataframe
df3[df3['points'] < 55]


Unnamed: 0,points,time,year,month,points_h1
0,50.0,5:00,2010,,
1,25.0,6:00,2010,february,


In [30]:
df3[(df3['points'] < 55) & (df3['time'] == '6:00')]

Unnamed: 0,points,time,year,month,points_h1
1,25.0,6:00,2010,february,


In [31]:
df3[(df3['points'] < 55) | (df3['month'] == 'january')]

Unnamed: 0,points,time,year,month,points_h1
0,50.0,5:00,2010,,
1,25.0,6:00,2010,february,
2,90.0,9:00,2020,january,


In [32]:
# You can also use the function isin that is usefull
df3[(df3['points'] < 45) | (df3['month'].isin(['january', 'february']))]

Unnamed: 0,points,time,year,month,points_h1
1,25.0,6:00,2010,february,
2,90.0,9:00,2020,january,


**Excersice 1:** https://github.com/novillo-cs/softdev_material/blob/main/classwork/unit_7/01_pandas.md

## Day 2

In [33]:
# You can set a column as the index
df3 = df3.set_index('points')

In [34]:
df3

Unnamed: 0_level_0,time,year,month,points_h1
points,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
50.0,5:00,2010,,
25.0,6:00,2010,february,
90.0,9:00,2020,january,
,,2020,june,20.0


In [35]:
# filter by index
df3.loc[[25.0]]

Unnamed: 0_level_0,time,year,month,points_h1
points,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
25.0,6:00,2010,february,


In [36]:
# filter by index and column
df3.loc[25.0, 'month']

'february'

In [37]:
# filter: more than one index
df3.loc[[25.0, 90.0]]

Unnamed: 0_level_0,time,year,month,points_h1
points,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
25.0,6:00,2010,february,
90.0,9:00,2020,january,


In [38]:
# selection by position with iloc
df3.iloc[0]

time         5:00
year         2010
month         NaN
points_h1     NaN
Name: 50.0, dtype: object

In [39]:
# iloc and column index
df3.iloc[0, 0]

'5:00'

In [40]:
# iloc and column name
df3.iloc[0].year

2010

In [41]:
# You can reset the index if you need to
df3 = df3.reset_index()
df3


Unnamed: 0,points,time,year,month,points_h1
0,50.0,5:00,2010,,
1,25.0,6:00,2010,february,
2,90.0,9:00,2020,january,
3,,,2020,june,20.0


In [42]:
df3

Unnamed: 0,points,time,year,month,points_h1
0,50.0,5:00,2010,,
1,25.0,6:00,2010,february,
2,90.0,9:00,2020,january,
3,,,2020,june,20.0


### Sort

In [43]:
# You can sort your dataframe as you want. Sort by one column:
df3.sort_values('points')

Unnamed: 0,points,time,year,month,points_h1
1,25.0,6:00,2010,february,
0,50.0,5:00,2010,,
2,90.0,9:00,2020,january,
3,,,2020,june,20.0


In [44]:
# Multi-column sort
df3.sort_values(['year', 'month'])

Unnamed: 0,points,time,year,month,points_h1
1,25.0,6:00,2010,february,
0,50.0,5:00,2010,,
2,90.0,9:00,2020,january,
3,,,2020,june,20.0


In [45]:
# This time sorting by year ascending and by month descending
df3.sort_values(['year', 'month'], ascending=[True, False])

Unnamed: 0,points,time,year,month,points_h1
1,25.0,6:00,2010,february,
0,50.0,5:00,2010,,
3,,,2020,june,20.0
2,90.0,9:00,2020,january,


## Functions and Dataframes

In [46]:
df4 = pd.DataFrame.from_records([
    {'one_c': 1, "two_c": 2, "three_c": 3, "four_c": 4},
    {'one_c': 2, "two_c": 5, "three_c": 9, "four_c": 20},
    {'one_c': 3, "two_c": 6, "three_c": 12, "four_c": 100},
    {'one_c': 4, "two_c": 7, "three_c": 15, "four_c": 150},
])

In [47]:
df4

Unnamed: 0,one_c,two_c,three_c,four_c
0,1,2,3,4
1,2,5,9,20
2,3,6,12,100
3,4,7,15,150


In [48]:
#Let's see how to use some functions with dataframe
#Let's create a funcion just to multiply two values by np.pi
def multiple_example(value1, value2):
    return value1 * value2 * np.pi

In [49]:
# Let's use this function with our dataframe
# This line will apply the function to all rows
df4['five_c'] = multiple_example(df4['one_c'], df4['two_c'])

In [50]:
df4

Unnamed: 0,one_c,two_c,three_c,four_c,five_c
0,1,2,3,4,6.283185
1,2,5,9,20,31.415927
2,3,6,12,100,56.548668
3,4,7,15,150,87.964594


Great! Column five_c has exactly what we want, using the function with the dataframe applied the function to all rows.

This is perfect. We do not have to go over each row, because pandas takes care of that.

In [51]:
# We could have done it directly without using a funcion
df4['six_c'] = df4['one_c'] * df4['two_c'] * np.pi

In [52]:
df4

Unnamed: 0,one_c,two_c,three_c,four_c,five_c,six_c
0,1,2,3,4,6.283185,6.283185
1,2,5,9,20,31.415927,31.415927
2,3,6,12,100,56.548668,56.548668
3,4,7,15,150,87.964594,87.964594


For simple things like the multiplication example, you do not need to create a function.

However, when you need to implement something using loops or coditions, the best option is to create a function.

### Group by

In [53]:
df3

Unnamed: 0,points,time,year,month,points_h1
0,50.0,5:00,2010,,
1,25.0,6:00,2010,february,
2,90.0,9:00,2020,january,
3,,,2020,june,20.0


In [54]:
df3.groupby(['year'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f213edd38f0>

In [55]:
df3.groupby(['year']).sum()

Unnamed: 0_level_0,points,time,month,points_h1
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010,75.0,5:006:00,february,0.0
2020,90.0,9:00,januaryjune,20.0


In [56]:
# Columns different behavior. You can exclude columns if you want.
df3.groupby(['year']).agg({'points': 'mean', 'time': 'first', 'month': 'last'})

Unnamed: 0_level_0,points,time,month
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010,37.5,5:00,february
2020,90.0,9:00,june


In [61]:
grouped = df3.groupby(['year'])

Sometimes you may have to do something that cannot be done with the apply function. 
For example, use multiprocessing or something that does not need pandas.
In thar case you will need to use a loop to traverse the groupby  result.

In [64]:
# You can traverse your groups and get some info
for group in grouped:
    print("group")
    print(type(group))
    print(len(group))
    print(group[0])
    print(group[1])
    print(type(group[1]))
    # And then you can call anything you want
    # You have group[0] which is the key
    # And group[1] which is the dataframe for the group[0] key
    

group
<class 'tuple'>
2
(2010,)
   points  time  year     month  points_h1
0    50.0  5:00  2010       NaN        NaN
1    25.0  6:00  2010  february        NaN
<class 'pandas.core.frame.DataFrame'>
group
<class 'tuple'>
2
(2020,)
   points  time  year    month  points_h1
2    90.0  9:00  2020  january        NaN
3     NaN   NaN  2020     june       20.0
<class 'pandas.core.frame.DataFrame'>


## Rename columns

In [60]:
df3.rename(columns={"points": "Points", "time": "Time", "month": "Month"})

Unnamed: 0,Points,Time,year,Month,points_h1
0,50.0,5:00,2010,,
1,25.0,6:00,2010,february,
2,90.0,9:00,2020,january,
3,,,2020,june,20.0


## Read a CSV file

In [57]:
laptop_df = pd.read_csv('laptops.csv')

In [58]:
laptop_df

Unnamed: 0,brand,processor_brand,processor_name,processor_gnrtn,ram_gb,ram_type,ssd,hdd,os,os_bit,graphic_card_gb,weight,warranty,Touchscreen,msoffice,Price,rating,Number of Ratings,Number of Reviews
0,ASUS,Intel,Core i3,10th,4 GB,DDR4,0 GB,1024 GB,Windows,64-bit,0 GB,Casual,No warranty,No,No,34649,2.0,3,0
1,Lenovo,Intel,Core i3,10th,4 GB,DDR4,0 GB,1024 GB,Windows,64-bit,0 GB,Casual,No warranty,No,No,38999,3.0,65,5
2,Lenovo,Intel,Core i3,10th,4 GB,DDR4,0 GB,1024 GB,Windows,64-bit,0 GB,Casual,No warranty,No,No,39999,3.0,8,1
3,ASUS,Intel,Core i5,10th,8 GB,DDR4,512 GB,0 GB,Windows,32-bit,2 GB,Casual,No warranty,No,No,69990,3.0,0,0
4,ASUS,Intel,Celeron Dual,Not Available,4 GB,DDR4,0 GB,512 GB,Windows,64-bit,0 GB,Casual,No warranty,No,No,26990,3.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
818,ASUS,AMD,Ryzen 9,Not Available,4 GB,DDR4,1024 GB,0 GB,Windows,64-bit,0 GB,Casual,1 year,No,No,135990,3.0,0,0
819,ASUS,AMD,Ryzen 9,Not Available,4 GB,DDR4,1024 GB,0 GB,Windows,64-bit,0 GB,Casual,1 year,No,No,144990,3.0,0,0
820,ASUS,AMD,Ryzen 9,Not Available,4 GB,DDR4,1024 GB,0 GB,Windows,64-bit,4 GB,Casual,1 year,No,No,149990,3.0,0,0
821,ASUS,AMD,Ryzen 9,Not Available,4 GB,DDR4,1024 GB,0 GB,Windows,64-bit,4 GB,Casual,1 year,No,No,142990,3.0,0,0


**Excersice 2:** https://github.com/novillo-cs/softdev_material/blob/main/classwork/unit_7/02_pandas