## Introducing Pandas data structures: Series, DataFrames and Index objects.

Pandas is a library built on Numpy which is used for data manipulation, with other ways of indexing other than integers. 

Series, DataFrame, and index are the basic data structures in this library.

Series in pandas can be referred to as a one dimensional array with homogenous elements of different types somewhat similar to numpy arrays; however, it can be indexed differently with specified descriptive labels or integers.  

In [2]:
import pandas as pd
days=pd.Series(['Mon', 'Tue', 'Wen'])
print(days)

0    Mon
1    Tue
2    Wen
dtype: object


In [4]:
#Creating series with a numpy array
import numpy as np
day_list=np.array(['Mon', 'Tue', 'Wen'])
numpy_days=pd.Series(day_list)
print(numpy_days)

0    Mon
1    Tue
2    Wen
dtype: object


In [6]:
days=pd.Series(['Mon', 'Tue', 'Wen'], index=['a', 'b', 'c'])
print(days)

a    Mon
b    Tue
c    Wen
dtype: object


#### Series can be accessed using the specified index as shown below

In [7]:
days[0]

'Mon'

In [8]:
days[1:]

b    Tue
c    Wen
dtype: object

In [9]:
days['c']

'Wen'

### DATAFRAMES

A DataFrame can be described as a table (2 dimensions) made up of many series with the same index.

It holds data in rows and columns just like a spreadsheet. 

Series, dictionaries, lists, other dataframes, and numpy arrays can be used to create new ones. 

In [10]:
print(pd.DataFrame())

Empty DataFrame
Columns: []
Index: []


In [12]:
# Create a dataframe from a dictionary
df_dict = {'Country' : ['Ghana', 'Kenya', 'Nigeria', 'Togo'],
          'capital' : ['Accra', 'Nairobi', 'Abuja', 'Lone'],
          'Population' : [285849, 7493494, 9383783, 998370],
          'Age' : [20,68,67,23]}
df = pd.DataFrame(df_dict, index=[2,4,6,8])
df

Unnamed: 0,Country,capital,Population,Age
2,Ghana,Accra,285849,20
4,Kenya,Nairobi,7493494,68
6,Nigeria,Abuja,9383783,67
8,Togo,Lone,998370,23


at, iat, iloc and loc are accessors used to retrieve data in dataframes.

iloc selects values from the rows and columns by using integer index to locate positions,

while loc selects rows or columns using labels. 

at and iat are used to retrieve single values such that at uses the column and row labels and iat uses indices.

In [13]:
df.iloc[3]

Country         Togo
capital         Lone
Population    998370
Age               23
Name: 8, dtype: object

In [15]:
df.loc[6]

Country       Nigeria
capital         Abuja
Population    9383783
Age                67
Name: 6, dtype: object

In [16]:
df['capital']

2      Accra
4    Nairobi
6      Abuja
8       Lone
Name: capital, dtype: object

In [20]:
df.at[6, 'Country']

'Nigeria'

In [21]:
df.iat[2,1]

'Abuja'

Finally, Indexes in pandas are immutable arrays with unique elements. They can also be described as ordered sets for retrieving data in a dataframe and collaborating with multiple dataframes.

The important Pandas functionalities: indexing, reindexing, selection, group, drop entities, ranking, sorting, duplicates and indexing by hierarchy.

Summary and descriptive statistics: measure of central tendency, measure of dispersion, skewness and kurtosis, correlation and multicollinearity.

Similar to Numpy, Pandas has some functions that provide descriptive statistics such as the measures of central tendency, dispersion, skewness and kurtosis, correlation and multicollinearity. 

Some functions are mode(), median(), mean(), sum(), std(), var(), skew(), kurt() and min().

The describe function gives the summary  of the numeric columns in a dataframe displaying count, mean, standard deviation, interquartile range, minimum and maximum values.



In [23]:
df['Population'].sum()

18161496

In [24]:
df.mean()

Population    4540374.0
Age                44.5
dtype: float64

In [25]:
df.describe()

Unnamed: 0,Population,Age
count,4.0,4.0
mean,4540374.0,44.5
std,4576254.0,26.589472
min,285849.0,20.0
25%,820239.8,22.25
50%,4245932.0,45.0
75%,7966066.0,67.25
max,9383783.0,68.0


Often, data used for analysis in real life scenarios is incomplete as a result of omission, faulty devices, and many other factors.

Pandas represent missing values as NA or NaN which can be filled, removed, and detected with functions like fillna(), dropna(), isnull(), notnull(), replace().