### Requirements

In [2]:
import pandas as pd

# for plotting
import matplotlib.pyplot as plt



# 9. Pandas: Basic Functions and Operations

### Importing and Exporting data
Pandas makes it easy to import and export data with various formats. To name a few:

|Format| reader | writer|
|---- |----|----|
|MS Excel| 	read_excel| 	to_excel|
|CSV |  	read_csv |	to_csv|
|HDF5|      read_hdf |	to_hdf|
|HTML|  	read_html |	to_html|
|Stata| 	read_stata |	to_stata|



When you write *pd.read_* and press tab, you can see what kind of data formats are supported. 

In [32]:
pd.read_

AttributeError: module 'pandas' has no attribute 'read_'

e.g. with the method

                                        pd.read_csv(csv_file)
                                        
you can load data directly into a Pandas DataFrame. For *csv_file* you can either specify the local filename (saved on your computer) or use an internet address. 

CSV means comma separated values. It is just a text file and you can open it with any texteditor. Normally the first row contains the comma separated column names and subsequently every row is one data sample where the individual features are separated by a comma.

Sometimes not the comma is the separator, but other symbols like the semicolon. Then you can use the *delimiter=* argument. When you want to skip the first rows of the file, because they contain some description, you can use the *skiprows=* argument. There are also many other arguments.
When you have problems with importing data from a csv file, you should definitely take a look at the documentation

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html



In [3]:
# We will load some data from the file wo_men.csv

shoesizes_height = pd.read_csv("wo_men.csv")

shoesizes_height 

Unnamed: 0,time,gender,height,shoe_size
0,04.10.2016 17:58:51,woman,160.0,40.0
1,04.10.2016 17:58:59,woman,171.0,39.0
2,04.10.2016 18:00:15,woman,174.0,39.0
3,04.10.2016 18:01:17,woman,176.0,40.0
4,04.10.2016 18:01:22,man,195.0,46.0
...,...,...,...,...
96,17.10.2016 12:37:09,woman,170.0,39.0
97,17.10.2016 13:12:48,woman,183.0,39.0
98,19.10.2016 17:07:53,woman,173.0,40.0
99,29.10.2016 20:28:33,woman,160.0,37.0


In [4]:
# print out just the first 5 rows to get an idea
# especially useful when you have huge data sets
shoesizes_height.head() 

Unnamed: 0,time,gender,height,shoe_size
0,04.10.2016 17:58:51,woman,160.0,40.0
1,04.10.2016 17:58:59,woman,171.0,39.0
2,04.10.2016 18:00:15,woman,174.0,39.0
3,04.10.2016 18:01:17,woman,176.0,40.0
4,04.10.2016 18:01:22,man,195.0,46.0


### Converting to Numpy

With the operation 

                        .to_numpy()
                        
you can convert a DataFrame or a Series to a numpy array.

In the following cell we want to create a numpy matrix from the two columns gender and height.

In [5]:
shoesizes_height[['gender', 'height']].to_numpy()

array([['woman', 160.0],
       ['woman', 171.0],
       ['woman', 174.0],
       ['woman', 176.0],
       ['man', 195.0],
       ['woman', 157.0],
       ['woman', 160.0],
       ['woman', 178.0],
       ['woman', 168.0],
       ['man', 171.0],
       ['woman', 165.0],
       ['man', 175.0],
       ['woman', 163.0],
       ['woman', 158.0],
       ['woman', 159.0],
       ['man', 183.0],
       ['woman', 155.0],
       ['woman', 172.0],
       ['woman', 164.0],
       ['woman', 158.0],
       ['woman', 174.0],
       ['woman', 164.0],
       ['woman', 168.0],
       ['woman', 168.0],
       ['woman', 163.0],
       ['woman', 160.0],
       ['man', 183.0],
       ['woman', 161.0],
       ['woman', 162.0],
       ['woman', 165.0],
       ['woman', 164.0],
       ['woman', 161.0],
       ['woman', 163.0],
       ['woman', 169.0],
       ['woman', 171.0],
       ['woman', 163.0],
       ['woman', 159.0],
       ['woman', 180.0],
       ['woman', 168.0],
       ['woman', 170.0],
       ['w

### Delete data

Because the time column is useless, we want to delete it with the pandas function:

                                        data_frame.drop(c_labels, axis=1)
                                        
The axis is one because we want to delete columns. If we want to delete rows, then axis would be 0.

For *c_labels* we need the exact column name of the columns that we want to delete.

First we check which columns are available

In [6]:
shoesizes_height.columns

Index(['time', 'gender', 'height', 'shoe_size'], dtype='object')

Now we delete 'time'

In [7]:
shoesizes_height = shoesizes_height.drop(['time'], axis=1)

In [8]:
shoesizes_height

Unnamed: 0,gender,height,shoe_size
0,woman,160.0,40.0
1,woman,171.0,39.0
2,woman,174.0,39.0
3,woman,176.0,40.0
4,man,195.0,46.0
...,...,...,...
96,woman,170.0,39.0
97,woman,183.0,39.0
98,woman,173.0,40.0
99,woman,160.0,37.0


### Descritptive Statistics

1. The **sum** of all shoesizes 

In [9]:
shoesizes_height['shoe_size'].sum()

3977.5

2. The **mean** shoe size

In [11]:
shoesizes_height['shoe_size'].mean()

39.775

3. Sample **standard deviation** of shoe size

In [13]:
shoesizes_height['shoe_size'].std()

5.556130020804123

5. **Summary statistics** 

In [15]:
# for column population
shoesizes_height['shoe_size'].describe()

count    100.00000
mean      39.77500
std        5.55613
min       35.00000
25%       38.00000
50%       39.00000
75%       40.00000
max       88.00000
Name: shoe_size, dtype: float64

In [16]:
# for all columns
shoesizes_height.describe()

Unnamed: 0,height,shoe_size
count,100.0,100.0
mean,165.2338,39.775
std,39.817544,5.55613
min,1.63,35.0
25%,163.0,38.0
50%,168.5,39.0
75%,174.25,40.0
max,364.0,88.0


In [17]:
# Get data types per column
shoesizes_height.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   gender     100 non-null    object 
 1   height     100 non-null    float64
 2   shoe_size  100 non-null    float64
dtypes: float64(2), object(1)
memory usage: 2.5+ KB


### Function application

These were all methods from the pandas library. If you want to apply your own function or a function from another library to a pandas DataFrame, use the
      
                                            apply()
                                    
method for row/ column wise application.

Example: In our data set, the height and shoe size of women and men are collected. 

For further processing we want to convert the height of them to the strings "small" or "tall".

In [24]:
def convert_height(height):
    if height < 170:
        return "small"
    return "tall"

shoesizes_height['height2'] = shoesizes_height['height'].apply(convert_height)

In [25]:
shoesizes_height

Unnamed: 0,gender,height,shoe_size,height2
0,woman,160.0,40.0,small
1,woman,171.0,39.0,tall
2,woman,174.0,39.0,tall
3,woman,176.0,40.0,tall
4,man,195.0,46.0,tall
...,...,...,...,...
96,woman,170.0,39.0,tall
97,woman,183.0,39.0,tall
98,woman,173.0,40.0,tall
99,woman,160.0,37.0,small


This is a general way how to convert values. In fact, in this case you can use the in-build Pandas method

                                            .replace('zero', 0)
                                            
in this example all string values *zero* are replaced with the number 0. 
                                            