# Data Manipulation with Panda

## DataFrame
A DataFrame is a rectangular (or tabular) representation of data. It consists of _rows_ and _columns_. _columns_ can have various data-types, for example *text*, *float* or *integer*. Consequentially, each value in a column has the same data-type. 

This notebook works with an example dataset on homelessness in US States.

We can load data directly from CSV files with _pandas_.

In [5]:
import pandas as pd

homelessness = pd.read_csv("./homelessness.csv")

## Inspecting DataFrames

#### Basic structure
We can inspect the basic structure of a DataFrame. A neat way to do this is by using it's method `.head()` which will only return the first few rows of a DataFrame instead of the whole table. 

In [7]:
print(homelessness.head())

   Unnamed: 0              region       state  individuals  family_members  \
0           0  East South Central     Alabama       2570.0           864.0   
1           1             Pacific      Alaska       1434.0           582.0   
2           2            Mountain     Arizona       7259.0          2606.0   
3           3  West South Central    Arkansas       2280.0           432.0   
4           4             Pacific  California     109008.0         20964.0   

   state_pop  
0    4887681  
1     735139  
2    7158024  
3    3009733  
4   39461588  


#### Retrieve Details with `.info()`
We might also need additional information to understand our DataFrame such as the data-types or nullish nature of its columns. This can be done by using the `.info()` method. In the presentation below, we can for example retrieve the information that ...
1) our DataFrame has 51 rows.
2) the columns feature different data-types (dtypes).
3) the size of our DataFrame is about 2,5+kb big.



In [8]:
print(homelessness.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      51 non-null     int64  
 1   region          51 non-null     object 
 2   state           51 non-null     object 
 3   individuals     51 non-null     float64
 4   family_members  51 non-null     float64
 5   state_pop       51 non-null     int64  
dtypes: float64(2), int64(2), object(2)
memory usage: 2.5+ KB
None


#### Count of rows & columns with `.shape`

The `.shape` property holds information on the count of rows and columns of our DataFrame.
The following output has to be read as _(count of rows, count of columns)_.

In [10]:
print(homelessness.shape)

(51, 6)


#### Numeric description with `.describe()`
Pandas can return us a helpful overview of numeric values about our DataFrame. With `.describe()` we can receive an overview of the average, standard deviation, min, max and various other values of our numeric table.

In [11]:
print(homelessness.describe())

       Unnamed: 0    individuals  family_members     state_pop
count   51.000000      51.000000       51.000000  5.100000e+01
mean    25.000000    7225.784314     3504.882353  6.405637e+06
std     14.866069   15991.025083     7805.411811  7.327258e+06
min      0.000000     434.000000       75.000000  5.776010e+05
25%     12.500000    1446.500000      592.000000  1.777414e+06
50%     25.000000    3082.000000     1482.000000  4.461153e+06
75%     37.500000    6781.500000     3196.000000  7.340946e+06
max     50.000000  109008.000000    52070.000000  3.946159e+07


In the example above, we can retrieve the information that...
1) (_min_) the lowest count of homeless persons in one of the 51 US States is 434.
2) (_max_) the count of the highest homelessness in another US State is 109.000 persons.
3) (_mean_) the average count of homeless persons by state is 7225 (_mean_) 