<a href="https://colab.research.google.com/github/miladziekanowska/Data_Analytics/blob/main/Pandas_basics_and_Dataframe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas
Pandas stands for *Python Data Analytics* and is one of the basic tools we will use for data analytics and data science. It is build on top of Numpy, therefore whenever we start working on a project that involves Pandas, we need to import Numpy as well.

Pandas is mainly used for easy computing with:
- tabular data;
- time series;
- matrices;
- records and statistics.


In [1]:
import pandas as pd
import numpy as np

## Data structures in Pandas
In Pandas we can differenciate two types of pandas-specific data structures. These would be:
- Series - `pd.Series`, which are farely similar to lists, but optimized. These will often be our columns, in:
- DataFrame - `pd.DataFrame`, which are presented as tabular data, similar to spreadsheets. The columns will contain


In [None]:
scores = pd.Series([25, 73, 94, 20]) # example of a Series
scores

In [None]:
type(scores) # using type we can check what we created

In [None]:
df = pd.DataFrame({'A' : [1, 2, 3],
                   'B' : [4, 5, 6]})
df # if we are working on one dataframe at a time, it's common to call it df and best practice is to include 'df' in the name always

In [None]:
type(df)

Let's create a random data frame and see what we can do with is. Most often we will create dfs from files (csv, json, xml, etc.).

In [None]:
# DataFrame is always two dimentional, just like a matrix. 
# When we are creating one from scrach, shape does not have to be specify in every column, just like in columns 'A' and 'F'
df = pd.DataFrame({
                    'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo',
                    'G' : [False, True,True,False]
                   })
df

In [None]:
# DataFrames are a special type of class in Pandas. Therefore they have some special methods within them and we do not need to call for functions as much.
df.info() # with this method we are getting information about all the columns (Series) in our DataFrame

The `df.info()` will give us information on all columns and their data types, as well as the quantity of non-null (or, in numpy and pandas, not NaN) values. 

This is very useful, when we are starting our analysis, for a number of reasons:
1. Null/NaN values might throw off our analysis;
2. If a column contains many Null/NaN values, perhaps we cannot use it for our analysis or need to contact our client/data bank, we might also think of a way to fill the Nulls/NaNs;
3. With some data types we need to work differently;
4. With non-numeric data types we might want to split them, take somethin out, etc.;
5. For more advanced analysis and ML, we might want to transform the non-numeric data into numeric data.

In [None]:
df.shape # same as with numpy, this will show us shape

In [22]:
df.size # and size

28

In [None]:
df.columns # this will show us all the column names

In [None]:
df.index # and this will show us available index

In [None]:
df.values # and values

## Importing a cvs file
Let's work on a mock-life example of a data frame. First for that, we need to import the file. With virtual environments like Google Collab or Jupyter Notebook we need to remember to provide the right path to the file then we are importing (as always, but it's slightly different).

In [None]:
sales_df = pd.read_csv('dane_sprzedaz.csv', sep=',', encoding='utf-8')
sales_df

In [None]:
# let's see how the data looks in here
sales_df.info()

As we can see, in the column named 'SPRZ_NETTO' we have two NaN values, but we will get to that later.

Let's try some of the methods we can use on a DataFrame.

In [None]:
# There are methods that will show us a given number of rows in a few manners
sales_df.head(5) # .head() gives us the first n rows

In [None]:
sales_df.tail(5) # .tail() will give us the last n rows

In [None]:
sales_df.sample(5) # .sample () will give us random n rows from the DataFrame. This is good to have a glimse, but not best practice for sampling

### Indexing and slicing 
Indexing and slicing in DataFrames is a mix of Numpy Matrices and Dictionaries. The way is similar and the indexes we call are usually in columns (sometimes with some conditions as filtering). 


In [None]:
sales_df.SKL_ID # this way we can get the whole column as a Series
#OR sales_df(['SKL_ID'])

In [None]:
sales_df[['DZIEN_DATA', 'TOW_ID']] #this way we can call for more than one column

### .iloc and .loc
These two methods allow us to slice through the DataFrame in two different ways. Both are useful (.iloc is more often used) for different tasks.

`.iloc` will look at the index given the position in the DataFrame;

`.loc` will look at the index as the given position.

In [28]:
# for this example let's use a new dataframe to better see the difference
b = pd.Series(np.round(np.random.uniform(0,1,10),2))
i = np.r_[0:10]
np.random.shuffle(i)
b.index = i
b

8    0.27
5    0.32
1    0.25
0    0.07
4    0.52
9    0.24
3    0.01
6    0.47
2    0.61
7    0.87
dtype: float64

In [30]:
b.iloc[2] # it will give us the third value in the overall table

0.25

In [31]:
b.loc[2] # this will give us the value with the index '2'

0.61