<a href="https://colab.research.google.com/github/marinadaniele/colab_links/blob/main/colab_9_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **PANDAS**

A powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

## Library Highlights
- A fast and efficient DataFrame object for data manipulation with integrated indexing;

- Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;

- Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;

- Flexible reshaping and pivoting of data sets;

- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;

- Columns can be inserted and deleted from data structures for size mutability;

- Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;

- High performance merging and joining of data sets;

- Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;

- Time series-functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;

- Highly optimized for performance, with critical code paths written in Cython or C.

- Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

### Mission
Pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.

### Vision
A world where data analytics and manipulation software is:

  - Accessible to everyone
  - Free for users to use and modify
  - Flexible
  - Powerful
  - Easy to use
  - Fast

# pandas

In [4]:
from IPython.display import HTML

# Use the direct URL to the GIF
HTML('<img src="https://media1.tenor.com/m/xrUe4KFY0dsAAAAC/brother-ew-ew.gif" width="300">')


In [None]:
# we import pandas --> import pandas as pd

In [5]:
import pandas as pd

## two types of classes:
- series: one dimensional labeled array (int, strings, obj)
- dataframe: two dimensional array or two dimensional structured data (table)


In [7]:
# series creation
import numpy as np
import pandas as pd

s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [14]:
# dataframe creation by passing a numpy array with a datetime index
dates = pd.date_range("20130101", periods=6)
dates

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

Unnamed: 0,A,B,C,D
2013-01-01,-2.018172,-0.255217,0.607589,-0.240345
2013-01-02,-1.82093,0.453782,0.541024,0.342343
2013-01-03,2.427937,-0.784962,-1.169947,0.183713
2013-01-04,-0.913897,0.342339,-0.061592,-1.090139
2013-01-05,-0.783093,-0.830761,-0.389726,0.510716
2013-01-06,-0.847126,0.226436,-0.612598,-0.548271


In [10]:
# creating a dataframe by passing a dictionary of objects where the keys are the column labels and the values are the column values.
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


# DATA VISUALIZATION

In [22]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

df.head() # to view the top ros of the frame

Unnamed: 0,A,B,C,D
2013-01-01,-1.25568,0.334417,0.417324,0.143811
2013-01-02,0.136952,-1.512522,-0.792019,0.51288
2013-01-03,1.454738,-0.319795,0.305296,0.108207
2013-01-04,-1.86647,0.053046,0.077662,-0.557932
2013-01-05,-1.764306,1.160243,-0.045909,0.717227


In [23]:
df.tail(3) # to view the bottom rows of the frame

Unnamed: 0,A,B,C,D
2013-01-04,-1.86647,0.053046,0.077662,-0.557932
2013-01-05,-1.764306,1.160243,-0.045909,0.717227
2013-01-06,0.744758,-1.188196,-0.372905,-1.464942


In [27]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [26]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [28]:
# Return a NumPy representation of the underlying data with DataFrame.to_numpy() without the index or column labels:

df.to_numpy()

array([[-1.2556798 ,  0.33441732,  0.41732374,  0.14381117],
       [ 0.13695248, -1.51252156, -0.79201916,  0.51288042],
       [ 1.45473755, -0.3197954 ,  0.30529596,  0.10820749],
       [-1.86647006,  0.05304568,  0.07766188, -0.55793223],
       [-1.76430558,  1.16024266, -0.04590939,  0.71722652],
       [ 0.74475781, -1.1881961 , -0.37290537, -1.46494198]])

In [29]:
df.dtypes


A    float64
B    float64
C    float64
D    float64
dtype: object

In [31]:
df.describe() #shows a quick statistic summary of the data

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,-1.25568,0.136952,1.454738,-1.86647,-1.764306,0.744758
B,0.334417,-1.512522,-0.319795,0.053046,1.160243,-1.188196
C,0.417324,-0.792019,0.305296,0.077662,-0.045909,-0.372905
D,0.143811,0.51288,0.108207,-0.557932,0.717227,-1.464942


In [32]:
df.T #transposing data

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,-1.25568,0.136952,1.454738,-1.86647,-1.764306,0.744758
B,0.334417,-1.512522,-0.319795,0.053046,1.160243,-1.188196
C,0.417324,-0.792019,0.305296,0.077662,-0.045909,-0.372905
D,0.143811,0.51288,0.108207,-0.557932,0.717227,-1.464942


In [34]:
df.sort_index(axis=1, ascending=False) # sorts by an axis

Unnamed: 0,D,C,B,A
2013-01-01,0.143811,0.417324,0.334417,-1.25568
2013-01-02,0.51288,-0.792019,-1.512522,0.136952
2013-01-03,0.108207,0.305296,-0.319795,1.454738
2013-01-04,-0.557932,0.077662,0.053046,-1.86647
2013-01-05,0.717227,-0.045909,1.160243,-1.764306
2013-01-06,-1.464942,-0.372905,-1.188196,0.744758


In [35]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-02,0.136952,-1.512522,-0.792019,0.51288
2013-01-06,0.744758,-1.188196,-0.372905,-1.464942
2013-01-03,1.454738,-0.319795,0.305296,0.108207
2013-01-04,-1.86647,0.053046,0.077662,-0.557932
2013-01-01,-1.25568,0.334417,0.417324,0.143811
2013-01-05,-1.764306,1.160243,-0.045909,0.717227


In [None]:
# @title A

from matplotlib import pyplot as plt
df['A'].plot(kind='hist', bins=99, title='A')
plt.gca().spines[['top', 'right',]].set_visible(False)