<a href="https://colab.research.google.com/github/miladziekanowska/Data_Analytics/blob/main/Pandas_basics_and_Dataframe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas
Pandas stands for *Python Data Analytics* and is one of the basic tools we will use for data analytics and data science. It is build on top of Numpy, therefore whenever we start working on a project that involves Pandas, we need to import Numpy as well.

Pandas is mainly used for easy computing with:
- tabular data;
- time series;
- matrices;
- records and statistics.


In [1]:
import pandas as pd
import numpy as np

## Data structures in Pandas
In Pandas we can differenciate two types of pandas-specific data structures. These would be:
- Series - `pd.Series`, which are farely similar to lists, but optimized. These will often be our columns, in:
- DataFrame - `pd.DataFrame`, which are presented as tabular data, similar to spreadsheets. The columns will contain


In [87]:
scores = pd.Series([25, 73, 94, 20]) # example of a Series
scores

0    25
1    73
2    94
3    20
dtype: int64

In [88]:
type(scores) # using type we can check what we created

pandas.core.series.Series

In [89]:
df = pd.DataFrame({'A' : [1, 2, 3],
                   'B' : [4, 5, 6]})
df # if we are working on one dataframe at a time, it's common to call it df and best practice is to include 'df' in the name always

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [90]:
type(df)

pandas.core.frame.DataFrame

Let's create a random data frame and see what we can do with is. Most often we will create dfs from files (csv, json, xml, etc.).

In [91]:
# DataFrame is always two dimentional, just like a matrix. 
# When we are creating one from scrach, shape does not have to be specify in every column, just like in columns 'A' and 'F'
df = pd.DataFrame({
                    'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo',
                    'G' : [False, True,True,False]
                   })
df

Unnamed: 0,A,B,C,D,E,F,G
0,1.0,2013-01-02,1.0,3,test,foo,False
1,1.0,2013-01-02,1.0,3,train,foo,True
2,1.0,2013-01-02,1.0,3,test,foo,True
3,1.0,2013-01-02,1.0,3,train,foo,False


In [92]:
# DataFrames are a special type of class in Pandas. Therefore they have some special methods within them and we do not need to call for functions as much.
df.info() # with this method we are getting information about all the columns (Series) in our DataFrame

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   A       4 non-null      float64       
 1   B       4 non-null      datetime64[ns]
 2   C       4 non-null      float32       
 3   D       4 non-null      int32         
 4   E       4 non-null      category      
 5   F       4 non-null      object        
 6   G       4 non-null      bool          
dtypes: bool(1), category(1), datetime64[ns](1), float32(1), float64(1), int32(1), object(1)
memory usage: 292.0+ bytes


The `df.info()` will give us information on all columns and their data types, as well as the quantity of non-null (or, in numpy and pandas, not NaN) values. 

This is very useful, when we are starting our analysis, for a number of reasons:
1. Null/NaN values might throw off our analysis;
2. If a column contains many Null/NaN values, perhaps we cannot use it for our analysis or need to contact our client/data bank, we might also think of a way to fill the Nulls/NaNs;
3. With some data types we need to work differently;
4. With non-numeric data types we might want to split them, take somethin out, etc.;
5. For more advanced analysis and ML, we might want to transform the non-numeric data into numeric data.

In [93]:
df.shape # same as with numpy, this will show us shape

(4, 7)

In [94]:
df.size # and size

28

In [95]:
df.columns # this will show us all the column names

Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')

In [96]:
df.index # and this will show us available index

Int64Index([0, 1, 2, 3], dtype='int64')

In [97]:
df.values # and values

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo',
        False],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo',
        True],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo',
        True],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo',
        False]], dtype=object)

## Importing a cvs file
Let's work on a mock-life example of a data frame. First for that, we need to import the file. With virtual environments like Google Collab or Jupyter Notebook we need to remember to provide the right path to the file then we are importing (as always, but it's slightly different).

In [None]:
sales_df = pd.read_csv('dane_sprzedaz.csv', sep=',', encoding='utf-8')
sales_df

In [None]:
# let's see how the data looks in here
sales_df.info()

As we can see, in the column named 'SPRZ_NETTO' we have two NaN values, but we will get to that later.

Let's try some of the methods we can use on a DataFrame.

In [None]:
# There are methods that will show us a given number of rows in a few manners
sales_df.head(5) # .head() gives us the first n rows

In [None]:
sales_df.tail(5) # .tail() will give us the last n rows

In [None]:
sales_df.sample(5) # .sample () will give us random n rows from the DataFrame. This is good to have a glimse, but not best practice for sampling

### Indexing and slicing 
Indexing and slicing in DataFrames is a mix of Numpy Matrices and Dictionaries. The way is similar and the indexes we call are usually in columns (sometimes with some conditions as filtering). 


In [100]:
sales_df.SKL_ID # this way we can get the whole column as a Series
#OR sales_df(['SKL_ID']) to display as DataFrame

0      1
1      1
2      1
3      1
4      1
      ..
130    3
131    3
132    3
133    3
134    3
Name: SKL_ID, Length: 135, dtype: int64

In [101]:
sales_df[['DZIEN_DATA', 'TOW_ID']] #this way we can call for more than one column

Unnamed: 0,DZIEN_DATA,TOW_ID
0,01.05.2022,1
1,02.05.2022,1
2,03.05.2022,1
3,04.05.2022,1
4,05.05.2022,1
...,...,...
130,11.05.2022,3
131,12.05.2022,3
132,13.05.2022,3
133,14.05.2022,3


### .iloc and .loc
These two methods allow us to slice through the DataFrame in two different ways. Both are useful (.iloc is more often used) for different tasks.

`.iloc` will look at the index given the position in the DataFrame;

`.loc` will look at the index as the given position.

In [124]:
# for this example let's use a new dataframe to better see the difference
b = pd.Series(np.round(np.random.uniform(0,1,10),2))
i = np.r_[0:10]
np.random.shuffle(i)
b.index = i
b

1    0.83
8    0.12
6    0.84
9    0.75
5    0.47
2    0.80
7    0.60
4    0.65
0    0.05
3    0.93
dtype: float64

In [125]:
b.iloc[2] # it will give us the third value in the overall table

0.84

In [126]:
b.loc[2] # this will give us the value with the index '2'

0.8

In [127]:
b.iloc[2:8]

6    0.84
9    0.75
5    0.47
2    0.80
7    0.60
4    0.65
dtype: float64

In [128]:
b.loc[2:8] # if the first index (2) appears later in the data frame than the second (8), with loc we will receive an empty series or df

Series([], dtype: float64)

Let's try it out on our sales_df. It is organized and indexed with ascending numbers so .loc and .iloc will work the same.


In [129]:
sales_df.iloc[0]

DZIEN_DATA      01.05.2022
SKL_ID                   1
TOW_ID                   1
SPRZ_NETTO           416.0
ZYSK_PROCENT             9
ZYSK_WART            37,44
Name: 0, dtype: object

## Sorting and conditionals




In [None]:
sales_df.sort_values(by='SKL_ID', ascending=False) 
# using sort_values(by="column_name") we sort with the column and with ascending=False, it's descending

In [None]:
sales_df[sales_df['SKL_ID'] == 1] # we can filter with one condition

In [None]:
sales_df[(sales_df['SKL_ID'] == 2) & (sales_df['TOW_ID'] == 3)] # or we can use more

In Numpy and Pandas we will use different notation for logical operators:
- **&** as AND
- **|** as OR
- **~** as NOT (as the begining of the condition)

In [None]:
# .isin() method is good for string or object values, but can be used on any non-discrete values
sales_df[sales_df['SPRZ_NETTO'].isin(range(340, 380))]

### Other tricks
In other words, anything I can't put in a distinguished label.


In [None]:
# Transposition -> if we want to flip the DataFrame so the rows become columns and colums become rows
sales_df.T

In [None]:
# Reseting the index
# If we don't like the way the DataFrame is indexed, or we want to use the current 
# index as another column, we can use .reset_index() and it will create a new index from 0 to the number of values (ordered)
b.reset_index() # if we add drop=True in the brackets, the past index will be deleted

In [None]:
# Creating a copy - useful if we are doing more changes to the DataFrame and don't want to overwrite the df
a = b.copy()
a

## Columns editing
First step to speedy working with a DataFrame is getting to know the columns. If they have long names or spaces within them, or are not descriptive/mnemonic, we might want to change them. For this example we will change the names to something silly, just as an example.

We might also want to create columns with aggregated data or drop the ones that are not important or needed.

**! IMPORTANT ! **
With these changes it's better to work on a copy if we don't want to change the DataFrame.

In [None]:
sdf = sales_df.copy()
sdf.columns = ['A', 'B', 'C', 'D', 'E', 'F']
sdf

In [None]:
# If we just want to rename one of the columns
sdf = sdf.rename(columns={'E' : 'Column5'})
sdf

Creating a new column

In [None]:
# this is the easiest way, so we call upon the non-existing column and assign it a value, similar to dictionaries
sdf['G'] = 1
sdf