## Data analytics with python 
### Preparation for working with dataframes

Processing tables with the Pandas package in python requires fundamental knowledge of:
1. Combining functions, conditions and loops together
2. Dictionaries
3. Modules
4. Arrays and vectorization
5. File directory
6. Intro to pandas

### 1. Combining functions, conditions and loops together
Let's say we are going to reformat phone numbers from the list.

In [5]:
phone_nums = ['66885555555', 
              '660885555555', 
              '0827774444', 
              '0826662222', 
              '660885557777', 
              '6608855577779', 
              '9988776',
              '66892345678',
              '66892345679',
              '66892345579']

We need to turn the above list into this

In [6]:
phone_num_cleaned = ['+66885555555', 
                     '+66885555555', 
                     '+66827774444', 
                     '+66826662222', 
                     '+66885557777', 
                     '', 
                     '',
                     '+66892345678',
                     '+66892345679',
                     '+66892345579']

Our goal is to automate this job. So we might consider to create a function. But let's walk through each step first. <br>
firstly, let's find the patterns of phone number.

In [8]:
for num in phone_nums:
    print(len(num))
    print(num)
    print()

11
66885555555

12
660885555555

10
0827774444

10
0826662222

12
660885557777

13
6608855577779

7
9988776

11
66892345678

11
66892345679

11
66892345579



So we wil use the length of phone number as a screening criteria for phone number cleansing
1. if length = 11: we will add '+' in front of the first digit
2. if length = 10: we will replace the first digit with '+66'
3. if length = 12: we will replace '660' with '+66'
4. else (length = 7 or 13): we will remove the number

In [9]:
# case 1
num_11 = '66885555555'
num_11_c = '+' + num_11
num_11_c

'+66885555555'

In [10]:
# case 2
num_10 = '0826662222'
num_10_c = '+66' + num_10[1:]
num_10_c

'+66826662222'

In [12]:
# case 3 
num_12 = '660885557777'
num_12_c = '+66' + num_12[3:]
num_12_c

'+66885557777'

Next to that, let's combine all conditional statements with for loop.

In [14]:
for num in phone_nums:
    if len(num) == 11:
        print('+' + num)
    elif len(num) == 10:
        print('+66' + num[1:])
    elif len(num) == 12:
        print('+66' + num[3:])
    else:
        print('')

+66885555555
+66885555555
+66827774444
+66826662222
+66885557777


+66892345678
+66892345679
+66892345579


In [8]:
#nums = []
nums.append('H')

In [9]:
nums

[1, 2, 3, 'H']

Let's make this thing repeatable by creating a function

In [16]:
# define function
def clean_phone_number(nums):
    cleaned_nums = [] # Create an empty list to store cleaned numbers
    for num in nums: # Iteration on phone numbers list
        if len(num) == 11: # Conditions and jobs
            nnum = '+' + num
            cleaned_nums.append(nnum)
        elif len(num) == 10:
            nnum = '+66' + num[1:]
            cleaned_nums.append(nnum)
        elif len(num) == 12:
            nnum = '+66' + num[3:]
            cleaned_nums.append(nnum)
        else:
            cleaned_nums.append('')
    return cleaned_nums # Return cleaned data

In [23]:
# call 
cleaned = clean_phone_number(phone_nums)
cleaned

['+66885555555',
 '+66885555555',
 '+66827774444',
 '+66826662222',
 '+66885557777',
 '',
 '',
 '+66892345678',
 '+66892345679',
 '+66892345579']

In [24]:
phone_num_cleaned

['+66885555555',
 '+66885555555',
 '+66827774444',
 '+66826662222',
 '+66885557777',
 '',
 '',
 '+66892345678',
 '+66892345679',
 '+66892345579']

In [21]:
# check if both lists are equal
cleaned == phone_num_cleaned

True

### 2. Dictionaries
A very useful data structure, only available in python. A typical dictionary contains keys that we can lookup for their values.

In [10]:
car = {'Brand': 'Toyota',
       'Model': 'Camry',
       'Engine size': 1.8,
       'Year': 2011}

In [27]:
# get items
car.items()

dict_items([('Brand', 'Toyota'), ('Model', 'Camry'), ('Engine size', 1.8), ('Year', 2011)])

In [28]:
# get keys
car.keys()

dict_keys(['Brand', 'Model', 'Engine size', 'Year'])

In [29]:
# get values
car.values()

dict_values(['Toyota', 'Camry', 1.8, 2011])

In [31]:
# get the value based on the selected key
print(car['Brand'])
print(car['Year'])

Toyota
2011


In [11]:
# iterate trought each items
for item in car.items():
    print(item)

('Brand', 'Toyota')
('Model', 'Camry')
('Engine size', 1.8)
('Year', 2011)


Common use cases for data processing
1. Change column names. In this case, keys will be old columns and values will be new column names.

In [38]:
columns = {'Col1': 'col1', 'Col3': 'col3', 'Date': 'date', 'id': 'ID'}
columns

{'Col1': 'col1', 'Col3': 'col3', 'Date': 'date', 'id': 'ID'}

In [39]:
# what is the new column name for date?
columns['Date']

'date'

2. Create a so-called DATAFRAME, a data table that we are going to learn on how to process it. <br>
- Create a dictionary, keys will be columns and thier values are the records in each row.
- Convert dictionary into dataframe, this can be done by using Pandas module. (We will discuss module in the next section)

In [41]:
# create a dictionary
data = {'Month': ['Jan', 'Feb', 'Mar'],
        'Sales': [100, 222, 50]}
data

{'Month': ['Jan', 'Feb', 'Mar'], 'Sales': [100, 222, 50]}

### 3. Modules
Modules are set of code library containing a set of instant functions. For example, if we want round up a float number, we need to install math module to complete the task.

In [12]:
# without module
ceil(5.555)

NameError: name 'ceil' is not defined

In [13]:
# with module
import math
math.ceil(5.555)

6

Likewise, pandas is also module to work with PANel DAta (table). 

In [14]:
# recall the dataframe from previous section
data = {'Month': ['Jan', 'Feb', 'Mar'],
        'Sales': [100, 222, 50]}
print('data ', type(data))
display(data)

data  <class 'dict'>


{'Month': ['Jan', 'Feb', 'Mar'], 'Sales': [100, 222, 50]}

In [22]:
import pandas as pd

df = pd.DataFrame(data)
print('data frame ', type(df))
display(df)

data frame  <class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Month,Sales
0,Jan,100
1,Feb,222
2,Mar,50


### 4. Arrays and vectorization
Vectorization is a process converting data into arrays. Key advantages of doing this are:
- We write a shorter code to complete task like calculating values from two data structures. 
- The program will execute faster if we have a large dataset to process.

Let's consider the discount calculation task below.

In [23]:
# lists of prices and discount rate
prices = [100, 50, 25, 77, 66]
discounts = [0.2, 0.2, 0.3, 0.1, 0.05]

Firstly, we will loop trough both lists using zip() 

FYI, zip() combine two list together when iterating them

In [68]:
zipped = zip(prices, discounts)
list(zipped)

[(100, 0.2), (50, 0.2), (25, 0.3), (77, 0.1), (66, 0.05)]

Now we are ready to do the calculation

In [24]:
reduced_prices = []

for item in zip(prices, discounts):
    reduced_price = item[0] - (item[0]*item[1])
    reduced_prices.append(reduced_price)

display(reduced_prices)

[80.0, 40.0, 17.5, 69.3, 62.7]

Alternatively, we will convert lists into arrays and calculate values in a way of vectorization.

In [25]:
import numpy as np # Import module that helps us to work with numbers and arrays

# Convert lists into arrays
prices_arr = np.array(prices)
discounts_arr = np.array(discounts)

In [26]:
# Verify that there are arrays
display(prices_arr, discounts_arr)

array([100,  50,  25,  77,  66])

array([0.2 , 0.2 , 0.3 , 0.1 , 0.05])

Now we are ready to do the calculation (again). The code is very much shorter.

In [27]:
reduced_prices = prices_arr - (prices_arr*discounts_arr)
display(reduced_prices)

array([80. , 40. , 17.5, 69.3, 62.7])

When working with dataset, it will be processed using vectorization process as well.
<br>
<br>
Also, numpy arrays allow us to do more with data analysis

In [85]:
# scalar math operation with arrays
prices_arr*3

array([300, 150,  75, 231, 198])

In [86]:
prices_arr+3

array([103,  53,  28,  80,  69])

Another example is mean calculation, we can't apply mean() function directly

In [83]:
# calculate mean of the list
prices.mean()

AttributeError: 'list' object has no attribute 'mean'

In [82]:
# calculate mean of the array
prices_arr.mean()

63.6

Quick exercise: Total paid amount calculation. <br>
Assuming that we walk through each line of order and would like to summarize the amount that a particular customer paid to us. Total paid amount is calculated as: <br><br>
Total paid amount = (price - discount + shipping fee) + VAT (7% rate)  <br><br>

We expected the outcome to be <br>
array([139.1   ,  48.15  ,  16.05  , 887.993 ,  70.5665]) <br>

Note: All pricing information is available as arrays (see below)

In [30]:
prices =  np.array([100, 50, 25, 770, 66])
discounts = np.array([20, 5, 10, 0.1, 0.05])
shippings = np.array([50, 0, 0, 60, 0])

In [33]:
# your answer
a = (prices - discounts + shippings)*1.07
display(a)

array([139.1   ,  48.15  ,  16.05  , 887.993 ,  70.5665])

### 5. File directory
Working with datasets means we need to import/export files. To do that, we need to understand the file directory system first.

In [34]:
import os

# get to the current directory
print(os.getcwd())

C:\Users\natanop\Desktop\python_sat\week5_python\week5_python


In [35]:
# list all files in the directory
os.listdir()

['.ipynb_checkpoints',
 'datasets',
 'Week5_1-Data-analytics-python-df-prepare.ipynb',
 'Week5_2-Data-analytics-python-introPandas.ipynb']

now, let's create new folder called 'data' and list all files again

In [36]:
os.listdir()

['.ipynb_checkpoints',
 'data',
 'datasets',
 'Week5_1-Data-analytics-python-df-prepare.ipynb',
 'Week5_2-Data-analytics-python-introPandas.ipynb']

now, let's access that data directory. There is nothing in it.

In [120]:
os.chdir('C:\\Users\\natanop.pimonsathian\\Desktop\\Programming\\dataupskilling-main\\dataupskilling-main\\data')
print(os.getcwd())

C:\Users\natanop.pimonsathian\Desktop\Programming\dataupskilling-main\dataupskilling-main\data


In [121]:
os.listdir()

['.ipynb_checkpoints', 'texts.txt']

Go back to data folder and create .txt file. Then list the directory again.

In [122]:
os.listdir()

['.ipynb_checkpoints', 'texts.txt']

In [42]:
with open('C:/Users/natanop/Desktop/saturday/text2.txt') as f:
    lines = f.readlines()
    print(lines)

f.close()

['hi, hi']


Again, let's go back one step to the directory that contains this notebook

In [125]:
os.chdir('C:\\Users\\natanop.pimonsathian\\Desktop\\Programming\\dataupskilling-main\\dataupskilling-main')
print(os.getcwd())
print(os.listdir())

C:\Users\natanop.pimonsathian\Desktop\Programming\dataupskilling-main\dataupskilling-main
['.ipynb_checkpoints', 'data', 'Exercise_answers.ipynb', 'python_functions_controlflows.ipynb', 'python_lists_sets.ipynb', 'python_strings.ipynb', 'python_variables_expressions.ipynb', 'README.md', 'sample.txt', 'SQL Snipped.sql', 'Week5_1-Data-analytics-python-df-prepare.ipynb']


Read the file again. You will see we can't open it because there is no such file in this directory

In [126]:
with open('texts.txt') as f:
    lines = f.readlines()

f.close()

FileNotFoundError: [Errno 2] No such file or directory: 'texts.txt'

Try again using the correct directory, now you will be able to open the file.

In [127]:
with open('data/texts.txt') as f:
    lines = f.readlines()

f.close()

[]


### 5. Intro to Pandas
We have mentioned that pandas is a package using for processing dataset. As you may know, there are several ways to import dataset into python environment, including a convenient of using pd.read() to import csv or excel files.

In [43]:
# read csv
data_csv = pd.read_csv('datasets/advertising.csv')
data_csv

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,12.0
3,151.5,41.3,58.5,16.5
4,180.8,10.8,58.4,17.9
...,...,...,...,...
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,14.0
197,177.0,9.3,6.4,14.8
198,283.6,42.0,66.2,25.5


In [130]:
# read excel - specify sheet name
data_excel = pd.read_excel('datasets/ads_excel.xlsx', sheet_name='data')
data_excel

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,12.0
3,151.5,41.3,58.5,16.5
4,180.8,10.8,58.4,17.9
...,...,...,...,...
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,14.0
197,177.0,9.3,6.4,14.8
198,283.6,42.0,66.2,25.5


In [132]:
data_excel2 = pd.read_excel('datasets/ads_excel.xlsx', sheet_name='Sheet2')
data_excel2

Unnamed: 0.1,Unnamed: 0,Unnamed: 1
0,,
1,Sum of TV,Sum of Sales
2,29408.5,3026.1


We will move on to the next notebook to see pandas capability