<a href="https://colab.research.google.com/github/miladziekanowska/Data_Analytics/blob/main/Pandas_DataFrame.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas
Pandas stands for *Python Data Analytics* and is one of the basic tools we will use for data analytics and data science. It is build on top of Numpy, therefore whenever we start working on a project that involves Pandas, we need to import Numpy as well.

Pandas is mainly used for easy computing with:
- tabular data;
- time series;
- matrices;
- records and statistics.


In [16]:
import pandas as pd
import numpy as np

## Data structures in Pandas
In Pandas we can differenciate two types of pandas-specific data structures. These would be:
- Series - `pd.Series`, which are farely similar to lists, but optimized. These will often be our columns, in:
- DataFrame - `pd.DataFrame`, which are presented as tabular data, similar to spreadsheets. The columns will contain


In [17]:
scores = pd.Series([25, 73, 94, 20]) # example of a Series
scores

0    25
1    73
2    94
3    20
dtype: int64

In [18]:
type(scores) # using type we can check what we created

pandas.core.series.Series

In [19]:
df = pd.DataFrame({'A' : [1, 2, 3],
                   'B' : [4, 5, 6]})
df # if we are working on one dataframe at a time, it's common to call it df and best practice is to include 'df' in the name always

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [20]:
type(df)

pandas.core.frame.DataFrame

Let's create a random data frame and see what we can do with is. Most often we will create dfs from files (csv, json, xml, etc.).

In [21]:
# DataFrame is always two dimentional, just like a matrix. 
# When we are creating one from scrach, shape does not have to be specify in every column, just like in columns 'A' and 'F'
df = pd.DataFrame({
                    'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo',
                    'G' : [False, True,True,False]
                   })
df

Unnamed: 0,A,B,C,D,E,F,G
0,1.0,2013-01-02,1.0,3,test,foo,False
1,1.0,2013-01-02,1.0,3,train,foo,True
2,1.0,2013-01-02,1.0,3,test,foo,True
3,1.0,2013-01-02,1.0,3,train,foo,False


In [22]:
# DataFrames are a special type of class in Pandas. Therefore they have some special methods within them and we do not need to call for functions as much.
df.info() # with this method we are getting information about all the columns (Series) in our DataFrame

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   A       4 non-null      float64       
 1   B       4 non-null      datetime64[ns]
 2   C       4 non-null      float32       
 3   D       4 non-null      int32         
 4   E       4 non-null      category      
 5   F       4 non-null      object        
 6   G       4 non-null      bool          
dtypes: bool(1), category(1), datetime64[ns](1), float32(1), float64(1), int32(1), object(1)
memory usage: 292.0+ bytes


The `df.info()` will give us information on all columns and their data types, as well as the quantity of non-null (or, in numpy and pandas, not NaN) values. 

This is very useful, when we are starting our analysis, for a number of reasons:
1. Null/NaN values might throw off our analysis;
2. If a column contains many Null/NaN values, perhaps we cannot use it for our analysis or need to contact our client/data bank, we might also think of a way to fill the Nulls/NaNs;
3. With some data types we need to work differently;
4. With non-numeric data types we might want to split them, take somethin out, etc.;
5. For more advanced analysis and ML, we might want to transform the non-numeric data into numeric data.

In [23]:
df.shape # same as with numpy, this will show us shape

(4, 7)

In [24]:
df.size # and size

28

In [25]:
df.columns # this will show us all the column names

Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')

In [26]:
df.index # and this will show us available index

Int64Index([0, 1, 2, 3], dtype='int64')

In [27]:
df.values # and values

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo',
        False],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo',
        True],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo',
        True],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo',
        False]], dtype=object)

## Importing a cvs file
Let's work on a mock-life example of a data frame. First for that, we need to import the file. With virtual environments like Google Collab or Jupyter Notebook we need to remember to provide the right path to the file then we are importing (as always, but it's slightly different).

In [28]:
sales_df = pd.read_csv('dane_sprzedaz.csv', sep=',', encoding='utf-8')
sales_df

Unnamed: 0,DZIEN_DATA,SKL_ID,TOW_ID,SPRZ_NETTO,ZYSK_PROCENT,ZYSK_WART
0,01.05.2022,1,1,416.0,9,3744
1,02.05.2022,1,1,454.0,2,908
2,03.05.2022,1,1,392.0,13,5096
3,04.05.2022,1,1,498.0,9,4482
4,05.05.2022,1,1,341.0,15,5115
...,...,...,...,...,...,...
130,11.05.2022,3,3,374.0,7,2618
131,12.05.2022,3,3,390.0,4,156
132,13.05.2022,3,3,485.0,10,485
133,14.05.2022,3,3,479.0,3,1437


In [29]:
# let's see how the data looks in here
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135 entries, 0 to 134
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   DZIEN_DATA    135 non-null    object 
 1   SKL_ID        135 non-null    int64  
 2   TOW_ID        135 non-null    int64  
 3   SPRZ_NETTO    133 non-null    float64
 4   ZYSK_PROCENT  135 non-null    int64  
 5   ZYSK_WART     135 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 6.5+ KB


As we can see, in the column named 'SPRZ_NETTO' we have two NaN values, but we will get to that later.

Let's try some of the methods we can use on a DataFrame.

In [30]:
# There are methods that will show us a given number of rows in a few manners
sales_df.head(5) # .head() gives us the first n rows

Unnamed: 0,DZIEN_DATA,SKL_ID,TOW_ID,SPRZ_NETTO,ZYSK_PROCENT,ZYSK_WART
0,01.05.2022,1,1,416.0,9,3744
1,02.05.2022,1,1,454.0,2,908
2,03.05.2022,1,1,392.0,13,5096
3,04.05.2022,1,1,498.0,9,4482
4,05.05.2022,1,1,341.0,15,5115


In [31]:
sales_df.tail(5) # .tail() will give us the last n rows

Unnamed: 0,DZIEN_DATA,SKL_ID,TOW_ID,SPRZ_NETTO,ZYSK_PROCENT,ZYSK_WART
130,11.05.2022,3,3,374.0,7,2618
131,12.05.2022,3,3,390.0,4,156
132,13.05.2022,3,3,485.0,10,485
133,14.05.2022,3,3,479.0,3,1437
134,15.05.2022,3,3,366.0,13,4758


In [32]:
sales_df.sample(5) # .sample () will give us random n rows from the DataFrame. This is good to have a glimse, but not best practice for sampling

Unnamed: 0,DZIEN_DATA,SKL_ID,TOW_ID,SPRZ_NETTO,ZYSK_PROCENT,ZYSK_WART
6,07.05.2022,1,1,443.0,7,3101
131,12.05.2022,3,3,390.0,4,156
123,04.05.2022,3,3,475.0,11,5225
124,05.05.2022,3,3,323.0,5,1615
40,11.05.2022,1,3,464.0,6,2784


### Indexing and slicing 
Indexing and slicing in DataFrames is a mix of Numpy Matrices and Dictionaries. The way is similar and the indexes we call are usually in columns (sometimes with some conditions as filtering). 


In [33]:
sales_df.SKL_ID # this way we can get the whole column as a Series
#OR sales_df(['SKL_ID']) to display as DataFrame

0      1
1      1
2      1
3      1
4      1
      ..
130    3
131    3
132    3
133    3
134    3
Name: SKL_ID, Length: 135, dtype: int64

In [34]:
sales_df[['DZIEN_DATA', 'TOW_ID']] #this way we can call for more than one column

Unnamed: 0,DZIEN_DATA,TOW_ID
0,01.05.2022,1
1,02.05.2022,1
2,03.05.2022,1
3,04.05.2022,1
4,05.05.2022,1
...,...,...
130,11.05.2022,3
131,12.05.2022,3
132,13.05.2022,3
133,14.05.2022,3


### .iloc and .loc
These two methods allow us to slice through the DataFrame in two different ways. Both are useful (.iloc is more often used) for different tasks.

`.iloc` will look at the index given the position in the DataFrame;

`.loc` will look at the index as the given position.

In [35]:
# for this example let's use a new dataframe to better see the difference
b = pd.Series(np.round(np.random.uniform(0,1,10),2))
i = np.r_[0:10]
np.random.shuffle(i)
b.index = i
b

9    0.62
0    0.22
3    0.95
2    0.48
7    0.16
6    0.33
1    0.48
4    0.70
8    0.59
5    0.28
dtype: float64

In [36]:
b.iloc[2] # it will give us the third value in the overall table

0.95

In [37]:
b.loc[2] # this will give us the value with the index '2'

0.48

In [38]:
b.iloc[2:8]

3    0.95
2    0.48
7    0.16
6    0.33
1    0.48
4    0.70
dtype: float64

In [39]:
b.loc[2:8] # if the first index (2) appears later in the data frame than the second (8), with loc we will receive an empty series or df

2    0.48
7    0.16
6    0.33
1    0.48
4    0.70
8    0.59
dtype: float64

Let's try it out on our sales_df. It is organized and indexed with ascending numbers so .loc and .iloc will work the same.


In [40]:
sales_df.iloc[0]

DZIEN_DATA      01.05.2022
SKL_ID                   1
TOW_ID                   1
SPRZ_NETTO           416.0
ZYSK_PROCENT             9
ZYSK_WART            37,44
Name: 0, dtype: object

## Sorting and conditionals




In [41]:
sales_df.sort_values(by='SKL_ID', ascending=False) 
# using sort_values(by="column_name") we sort with the column and with ascending=False, it's descending

Unnamed: 0,DZIEN_DATA,SKL_ID,TOW_ID,SPRZ_NETTO,ZYSK_PROCENT,ZYSK_WART
134,15.05.2022,3,3,366.0,13,4758
100,11.05.2022,3,1,468.0,2,936
110,06.05.2022,3,2,378.0,4,1512
109,05.05.2022,3,2,443.0,8,3544
108,04.05.2022,3,2,386.0,6,2316
...,...,...,...,...,...,...
28,14.05.2022,1,2,442.0,7,3094
27,13.05.2022,1,2,364.0,12,4368
26,12.05.2022,1,2,409.0,6,2454
25,11.05.2022,1,2,312.0,5,156


In [42]:
sales_df[sales_df['SKL_ID'] == 1] # we can filter with one condition

Unnamed: 0,DZIEN_DATA,SKL_ID,TOW_ID,SPRZ_NETTO,ZYSK_PROCENT,ZYSK_WART
0,01.05.2022,1,1,416.0,9,3744
1,02.05.2022,1,1,454.0,2,908
2,03.05.2022,1,1,392.0,13,5096
3,04.05.2022,1,1,498.0,9,4482
4,05.05.2022,1,1,341.0,15,5115
5,06.05.2022,1,1,355.0,8,284
6,07.05.2022,1,1,443.0,7,3101
7,08.05.2022,1,1,384.0,12,4608
8,09.05.2022,1,1,327.0,5,1635
9,10.05.2022,1,1,317.0,2,634


In [43]:
sales_df[(sales_df['SKL_ID'] == 2) & (sales_df['TOW_ID'] == 3)] # or we can use more

Unnamed: 0,DZIEN_DATA,SKL_ID,TOW_ID,SPRZ_NETTO,ZYSK_PROCENT,ZYSK_WART
75,01.05.2022,2,3,317.0,12,3804
76,02.05.2022,2,3,412.0,9,3708
77,03.05.2022,2,3,368.0,15,552
78,04.05.2022,2,3,320.0,6,192
79,05.05.2022,2,3,318.0,7,2226
80,06.05.2022,2,3,442.0,13,5746
81,07.05.2022,2,3,373.0,15,5595
82,08.05.2022,2,3,350.0,3,105
83,09.05.2022,2,3,387.0,10,387
84,10.05.2022,2,3,369.0,4,1476


In Numpy and Pandas we will use different notation for logical operators:
- **&** as AND
- **|** as OR
- **~** as NOT (as the begining of the condition)

In [44]:
# .isin() method is good for string or object values, but can be used on any non-discrete values
sales_df[sales_df['SPRZ_NETTO'].isin(range(340, 380))]

Unnamed: 0,DZIEN_DATA,SKL_ID,TOW_ID,SPRZ_NETTO,ZYSK_PROCENT,ZYSK_WART
4,05.05.2022,1,1,341.0,15,5115
5,06.05.2022,1,1,355.0,8,284
10,11.05.2022,1,1,347.0,9,3123
13,14.05.2022,1,1,373.0,4,1492
15,01.05.2022,1,2,377.0,2,754
20,06.05.2022,1,2,369.0,15,5535
27,13.05.2022,1,2,364.0,12,4368
29,15.05.2022,1,2,368.0,3,1104
30,01.05.2022,1,3,366.0,8,2928
50,06.05.2022,2,1,375.0,7,2625


In [45]:
# .contains() we can use together with regular expressions to display all values that contain the search phrase
sales_df[sales_df.DZIEN_DATA.str.contains('13', regex=True, na=False)]

Unnamed: 0,DZIEN_DATA,SKL_ID,TOW_ID,SPRZ_NETTO,ZYSK_PROCENT,ZYSK_WART
12,13.05.2022,1,1,334.0,4,1336
27,13.05.2022,1,2,364.0,12,4368
42,13.05.2022,1,3,490.0,12,588
57,13.05.2022,2,1,381.0,6,2286
72,13.05.2022,2,2,355.0,4,142
87,13.05.2022,2,3,367.0,13,4771
102,13.05.2022,3,1,456.0,10,456
117,13.05.2022,3,2,319.0,5,1595
132,13.05.2022,3,3,485.0,10,485


### Other tricks
In other words, anything I can't put in a distinguished label.


In [46]:
# Transposition -> if we want to flip the DataFrame so the rows become columns and colums become rows
sales_df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,125,126,127,128,129,130,131,132,133,134
DZIEN_DATA,01.05.2022,02.05.2022,03.05.2022,04.05.2022,05.05.2022,06.05.2022,07.05.2022,08.05.2022,09.05.2022,10.05.2022,...,06.05.2022,07.05.2022,08.05.2022,09.05.2022,10.05.2022,11.05.2022,12.05.2022,13.05.2022,14.05.2022,15.05.2022
SKL_ID,1,1,1,1,1,1,1,1,1,1,...,3,3,3,3,3,3,3,3,3,3
TOW_ID,1,1,1,1,1,1,1,1,1,1,...,3,3,3,3,3,3,3,3,3,3
SPRZ_NETTO,416.0,454.0,392.0,498.0,341.0,355.0,443.0,384.0,327.0,317.0,...,481.0,450.0,437.0,388.0,463.0,374.0,390.0,485.0,479.0,366.0
ZYSK_PROCENT,9,2,13,9,15,8,7,12,5,2,...,7,6,13,11,7,7,4,10,3,13
ZYSK_WART,3744,908,5096,4482,5115,284,3101,4608,1635,634,...,3367,27,5681,4268,3241,2618,156,485,1437,4758


In [47]:
# Reseting the index
# If we don't like the way the DataFrame is indexed, or we want to use the current 
# index as another column, we can use .reset_index() and it will create a new index from 0 to the number of values (ordered)
b.reset_index() # if we add drop=True in the brackets, the past index will be deleted

Unnamed: 0,index,0
0,9,0.62
1,0,0.22
2,3,0.95
3,2,0.48
4,7,0.16
5,6,0.33
6,1,0.48
7,4,0.7
8,8,0.59
9,5,0.28


In [48]:
# Creating a copy - useful if we are doing more changes to the DataFrame and don't want to overwrite the df
a = b.copy()
a

9    0.62
0    0.22
3    0.95
2    0.48
7    0.16
6    0.33
1    0.48
4    0.70
8    0.59
5    0.28
dtype: float64

## Columns editing
First step to speedy working with a DataFrame is getting to know the columns. If they have long names or spaces within them, or are not descriptive/mnemonic, we might want to change them. For this example we will change the names to something silly, just as an example.

We might also want to create columns with aggregated data or drop the ones that are not important or needed.

**! IMPORTANT !**
With these changes it's better to work on a copy if we don't want to change the DataFrame.

In [None]:
sdf = sales_df.copy()
sdf.columns = ['A', 'B', 'C', 'D', 'E', 'F']
sdf

In [None]:
# If we just want to rename one of the columns
sdf = sdf.rename(columns={'E' : 'Column5'})
sdf

Creating a new column

In [None]:
# this is the easiest way, so we call upon the non-existing column and assign it a value, similar to dictionaries
sdf['G'] = 1
sdf

Deleting a column

In [None]:
# This is a way of droping not only columns, but can be used to drop rows - axis=1 is of columns, axis=0 is for rows
sdf.drop(['A'], axis=1) 

## NaN values
NaN values in Pandas or None/empty values in DataFrames can throw off the whole model and disrupt the analysis. Unfortunately in many open sourced data this error might happen and we need to take some actions in regards to this. 

If we are working on data from our client, for different team or from data bank, it is best to recognise these missing values and contact whoever provided the data for explanation. Perhaps this is a mistake and we are to expect a new data, or we need to do something with these values - drop the records, or fill them in some way. Filling missing values can be found in another notebook, *Pandas_DataCleaning* in this repository.

In [None]:
# let's create a new DataFrame with missing values
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), 
                  index=dates, 
                  columns=list('ABCD'))
df['D'] = -5
df['E'] = [np.nan, -1,-2,-3, -4, -5]
df['F'] = [1,2, np.nan, np.nan,np.nan, np.nan]
df

In [None]:
#Let's see the missing values
pd.isna(df)

The `pd.isna(df)` shows, where in the DataFrame we can find NaN values with *True* value and *False* when there is a value. 

In [None]:
df[df.isna().any(axis=1)] # this way we are filteringt the DataFrame and displaying only the row where NaN value appears

In [None]:
df.dropna() # with this method we will drop all the rows where NaN value appears

In [None]:
df.dropna(how='all') # if we add how='all' in here, we will only drop the rows where all the values are NaN

## DataFrame manipulation and merging

In [None]:
# Again, let's create a new DataFrame for this case 
df = pd.DataFrame(np.random.randn(10, 4))
df

In [None]:
# We can split the DataFrame into pieces (list), where every element will be a smaller DataFrame
pieces = [df[:3], df[3:7], df[7:]]
pieces

In [None]:
# From a list, we can also easily create the DataFrame back
pd.concat(pieces)

Okay, now back to merging the DataFrames.

In [84]:
# Let's create yet another DataFrames
df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'c': ['foo', 'baz'], 'b': [2, 4]})

We can distinguish four main types of merges (or joins in SQL, where these types work the same way but with different syntax).
![Join types](https://www.metabase.com/learn/images/sql-join-types/join-types.png)


In [None]:
df1.merge(df2, how='left', on='b') # Left outer join

In [None]:
df1.merge(df2, how='right', on='b') # Right outer join

In [None]:
df1.merge(df2, how='inner', on='b') # Inner join

In [None]:
df1.merge(df2, how='outer', on='b') # Full outer join

## Grouping the data in DataFrame
Grouping can be useful, when we want sort the data in another way and it is often used for statistics.

In [None]:
# You know what it is - another DataFrame for this example
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.random.randn(8)})
df

In [92]:
df.groupby('A') # If we just group by any column, we will just receive some information
# If we want to group, we need to aggregate the data in some way

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f30e21a6b20>

In [None]:
# Let's do it with sum()
df.groupby('A').sum() # This works

In [None]:
# Who said we can only group by one column
df2 = df.groupby(['A', 'B']).sum()
df2

## Data aggregations
Let's check some simple statistics with DataFrames.

In [105]:
df = pd.DataFrame(np.random.randn(6, 4), 
                  index=dates, 
                  columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,1.240183,1.053066,0.120661,1.253594
2013-01-02,0.010859,0.607829,0.516303,-0.243887
2013-01-03,0.164969,1.308236,3.094193,-1.004441
2013-01-04,0.669995,-0.75398,-0.849054,-0.808687
2013-01-05,1.220391,1.789693,-0.605187,-0.535197
2013-01-06,0.70453,-0.209953,-0.211632,1.46217


In [106]:
df.sum() # sums up all the numeric values in 

A    4.010927
B    3.794891
C    2.065284
D    0.123552
dtype: float64

In [107]:
sales_df.sum() # for non-numeric data it will return some weird valeus...

DZIEN_DATA      01.05.202202.05.202203.05.202204.05.202205.05....
SKL_ID                                                        270
TOW_ID                                                        270
SPRZ_NETTO                                                52864.0
ZYSK_PROCENT                                                 1034
ZYSK_WART       37,449,0850,9644,8251,1528,431,0146,0816,356,3...
dtype: object

In [111]:
sales_df.sum(numeric_only=True) # And now it is much clearer :)

SKL_ID            270.0
TOW_ID            270.0
SPRZ_NETTO      52864.0
ZYSK_PROCENT     1034.0
dtype: float64

In [112]:
df.min() # returns the smallest value in the colums

A    0.010859
B   -0.753980
C   -0.849054
D   -1.004441
dtype: float64

In [113]:
df.max() # return the largest value in the columns

A    1.240183
B    1.789693
C    3.094193
D    1.462170
dtype: float64

In [114]:
df.mean() # returns the mean of the columns

A    0.668488
B    0.632482
C    0.344214
D    0.020592
dtype: float64

In [115]:
df.median() # return the median of the columns

A    0.687263
B    0.830448
C   -0.045486
D   -0.389542
dtype: float64

In [117]:
df.std() # returns the standard deviation of the columns

A    0.513564
B    0.959714
C    1.433610
D    1.069176
dtype: float64

That would be all for this notebook. Next up - Pandas Data Cleaning and preparation! 