### Functions

Parameters of the form *param contain a variable number of arguments within a tuple. Parameters of the form **param contain a variable number of keyword arguments.
    
    def connect(uname, *args, **kwargs):
        #connecting code here

This is known as packing.
Within the function, we can treat argsas a list of the positional arguments provided and kwargsas a dictionary of keyword arguments provided.

In [1]:
def connect (uname, *args, **kwargs):
    print(uname)
    for arg in args:
        print(arg)
    for key in kwargs.keys():
        print(key, ":", kwargs[key])
connect('admin', 'ilovecats', server='localhost', port=9160)

admin
ilovecats
server : localhost
port : 9160


Worthy mentions:
arg, *arg and **kwargs

We can use *argsto pass in a tuple as a single argument to our function. This tuple should contain the arguments in the order in which they are meant to be bound to the formal parameters.

> args = ('one', 2, 3)

> func(*args)

arg1: one
arg2: 2
arg3: 3

We would say that we’re unpackinga tuple of arguments here.


We can use **kwargsto pass in a dictionary as a single argument to our function. This dictionary contains the formal parameters as keywords, associated with their argument values. Note that these can appear in any order.

> kwargs {"arg3":3,"arg1":"one","arg2":2}
> func(**kwargs)

arg1:one arg2:2 arg3:3

One can also define lambda functions within Python.
•Use the keyword lambdainstead of def.
•Can be used wherever function objects are used.
•Restricted to one expression.
•Typically used with functional programming tools –we will see this next time.

>def f(x):
    return x**2

>printf(8) == 64

>g =lambda x: x**2

>printg(8) == 64

## Rationale

I am a firm believer in having access to all of the content I create in a simple text format. That is part of the reason why I use pelican for the blog and write all content in restructured text. I also believe in hosting the blog using static HTML so it is fast for readers and simple to distribute. Since I spend a lot of time creating content, I want to make sure I can easily transform it into another format if needed. Plain text files are the best format for my needs.

As I wrote in [my previous post](https://pbpython.com/five-years.html), Mailchimp was getting cost prohibitive. In addition, I did not like playing around with formatting emails. I want to focus on content and turning it into a clean and responsive email - not working with an online email editor. I also want the newsletter archives available for people to view and search in a more integrated way with the blog.

One thing that Mailchimp does well is that it provides an archive of emails and ability for the owner to download them in raw text. However, once you cancel your account, those archives will go away. It’s also not very search engine friendly so it’s hard to reference back to it and expose the content to others not subscribed to the newsletter.

With all that in mind, here is the high level process I had in mind:

Welcome to the 6th edition of this newsletter.

## Around the site

* [Combining Multiple Excel Worksheets Into a Single Pandas Dataframe](https://pbpython.com/pandas-excel-tabs.html)
covers a simple approach to parse multiple excel tabs into one DataFrame.

## Other news

* [Altair](https://altair-viz.github.io/index.html) just released a new version. If you haven't looked at it in a while,
check out some of the [examples](https://altair-viz.github.io/gallery/index.html) for a snapshot of what you can do with it.

## Final Words

Thanks again for subscribing to the newsletter. Feel free to forward it on to others that may be interested.

In [51]:
# Also see CS50 from Harvard (Python Programming)

In [50]:
# The following from: https://365datascience.com/python-functions/

In [2]:
def plus_ten(a):
    return a+10

In [3]:
plus_ten(5)

15

In [4]:
def plus_10(x):
    result = x + 10
    print('outcome: ')
    return result

In [5]:
plus_10(5)

outcome: 


15

In [6]:
def wage(w_hours):
    return w_hours*25
def with_bonus(w_hours):
    return wage(w_hours)+50

In [7]:
wage(5), with_bonus(5)

(125, 175)

In [8]:
'''
def wage(w_hours_day, week):
    return w_hours_day * week * 25
'''

'\ndef wage(w_hours_day, week):\n    return w_hours_day * week * 25\n'

In [26]:
# Johny's mum pledge to give him extra $10 if he saves up to 100 a week.
def add_ten(m):
    if m >= 100:
        return 'You now have', m + 10
    else:
        return "Johnny please save more"

In [10]:
add_ten(40)

'Johnny please save more'

In [11]:
add_ten(99)

'Johnny please save more'

In [19]:
add_ten(125)

('You now have', 135)

In [27]:
def add_10(m):
    if m >= 100:
        m = m + 10
        return m
    else:
        # m = m + 0 (which means nothing was added to Johnny)
        return 'Save more Johnny!'

In [28]:
add_10(127)

137

In [29]:
add_10(89)

'Save more Johnny!'

In [34]:
# let's add more than one parameter
def sum(a, b, c):
    result = a-b*c
    print('a =', a)
    print('b =', b)
    print('c =', c)
    return result

In [35]:
sum(10, 3, 2)

a = 10
b = 3
c = 2


4

In [36]:
max(10, 20, 30)

30

In [37]:
min(10, 20, 30)

10

In [39]:
# returns absolute value of input
abs(-20)

20

In [41]:
# rounding off decimal values
round(3.556)

4

In [42]:
# roundin to 2 decimal places
round(3.556, 2)

3.56

In [46]:
# the two performs the same function
2**10
pow(2, 10)

1024

In [49]:
# how many characters (spaces included)
len('mathem atics')

12

In [5]:
# ndarray at hamoye
import numpy as np

In [2]:
a = [1, 2, 3, 4, 5]

In [3]:
type(a)

list

In [6]:
# convert to numpy array
data = np.array(a)

In [7]:
type(data)

numpy.ndarray

In [8]:
# what's the shape of my array
data.shape

(5,)

In [9]:
data.dtype # the data type

dtype('int32')

In [10]:
data.ndim # dimension of data

1

In [12]:
b = np.array([[2, 3, 4], [4, 5, 7]])

In [13]:
b

array([[2, 3, 4],
       [4, 5, 7]])

In [14]:
b.ndim

2

In [16]:
b.shape # 2D array with rows and 3 columns

(2, 3)

In [20]:
# a 2x3 array with zeroes
np.zeros((2, 3))

array([[0., 0., 0.],
       [0., 0., 0.]])

In [19]:
# a 2x3 array with random ones
np.ones((2,3))

array([[1., 1., 1.],
       [1., 1., 1.]])

In [26]:
# 2x3 with random values
np.random.randn(2,3)

array([[ 0.83542597,  2.12679019,  1.50491676],
       [-0.63573939, -0.36494818, -0.89776983]])

In [27]:
np.random.random((2,3))

array([[0.37253352, 0.31511647, 0.86257651],
       [0.12292388, 0.56286912, 0.64873169]])

In [28]:
# what's the difference btw random.random and random.randn ?

In [42]:
# random 2x4 integers 
np.random.randint(5, size=(2,4))

array([[2, 2, 1, 1],
       [2, 3, 3, 0]])

In [43]:
# intra-operability
c = np.array([[9.0, 8.0, 7.0], [1.0, 2.0, 3.0]])
d = np.array([[4.0, 5.0, 6.0], [9.0, 8.0, 7.0]])

In [44]:
c+d

array([[13., 13., 13.],
       [10., 10., 10.]])

In [45]:
c*d

array([[36., 40., 42.],
       [ 9., 16., 21.]])

In [46]:
5/d

array([[1.25      , 1.        , 0.83333333],
       [0.55555556, 0.625     , 0.71428571]])

In [47]:
c**2

array([[81., 64., 49.],
       [ 1.,  4.,  9.]])

In [48]:
pow(c, 2)

array([[81., 64., 49.],
       [ 1.,  4.,  9.]])

In [53]:
# arrays are indexed just like lists
c[1]

array([1., 2., 3.])

In [52]:
# identify element on second row at third column
c[1, 2]

3.0

In [58]:
# arrays can also be retrieved by slicing rows and columns or a combination of the two
c[1, 0:2]
# in second row (1), get me the 0th index and 2nd index values.

array([1., 2.])

In [60]:
# slicing
e = np.array([[10, 11, 12], [13, 14, 15], [16, 17, 18], [19, 20, 21]])

In [62]:
# slicing
# get me first three rows and two columns
e[:3, :2]

array([[10, 11],
       [13, 14],
       [16, 17]])

In [64]:
# integer indexing. This practically means picking out for yourself
e[[2, 0, 3, 1], [2, 1, 0, 2]]
# this gives me row3, col3 and so on...

array([18, 11, 19, 15])

In [65]:
# recall, in python, values start at 0 NOT 1
# in an array [rows][columns]. so, [2][3] means row3 col4
# while indexing [1, 3] implies, row3 col4

In [66]:
# boolean indexing
e[e>17]

array([18, 19, 20, 21])

Numpy also has inbuilt mathematical functions like sum(), mean(), std(), corrcoef(), min() and others. It interestingly allows for comparing arrays using == to check if two arrays have the same elements,  elements in the first array are greater than or less than those of the second array using  > and  <.

##### File input and output with arrays
Numpy arrays can be loaded from and saved to binary files with .npy as the extension using load() and save() respectively. This can also be done with text files with text files using loadtxt() and savetxt().

In [70]:
# comparing arrays. You can also do >, < 
b == c

array([[False, False, False],
       [False, False, False]])

In [72]:
b > c

array([[False, False, False],
       [ True,  True,  True]])

# Pandas is here

In [3]:
import pandas as pd

###### Series, DataFrame and index are the basic data structures in this library

In [77]:
e = pd.array(['men', 'boys'])

In [78]:
e

<StringArray>
['men', 'boys']
Length: 2, dtype: string

In [4]:
# Series in pandas can be referred to as a one dimensional array with homogenous 
# elements of different types somewhat similar to numpy arrays however, 
# it can be indexed differently with specified descriptive labels or integers.

e = pd.Series(['Monday', 'Tuesday'])

In [81]:
e

0     Monday
1    Tuesday
dtype: object

In [2]:
# Series can be accessed using the specified index as shown below

In [5]:
e[1]

'Tuesday'

In [6]:
new = pd.Series({'a': 'obi', 'b': 'ada', 'c': 'ikenna', 'd': 'ebuka', 'd': 'obiora' })

In [7]:
new[2]

'ikenna'

In [8]:
new[1:]

b       ada
c    ikenna
d    obiora
dtype: object

When you convert a dictionary to pandas series, the key automatically becomes your index.
With lists, you'd have to create an index for yourself.

In [13]:
new1 = pd.Series(['obi', 'ada', 'ikenna', 'ona', 'obiorah'],
                index=['a', 'b', 'c', 'd', 'e'])

In [14]:
new1[1]

'ada'

In [15]:
new1[3:]

d        ona
e    obiorah
dtype: object

In [16]:
new['d']

'obiora'

A DataFrame can be described as a table (2 dimensions) made up of many series with the same index. It holds data in rows and columns just like a spreadsheet. Series, dictionaries, lists other dataframes and numpy arrays can be used to create new ones. 

In [26]:
df = [['obi', 'ada', 'ikenna', 'ona', 'obiorah'],
      [1, 7, 5, 6, 3], ['hoe', 'matchet', 'cutlass', 'rake', 'shovel'],
      ['ishaga', 'iju', 'oshodi', 'lekki', 'agege'], [187, 192, 143, 155, 167]]

In [29]:
df1 = pd.DataFrame(df, columns=['a', 'b', 'c', 'd', 'e'], 
                   index=['name', 'id', 'implement', 'address', 'height'])

In [30]:
df1

Unnamed: 0,a,b,c,d,e
name,obi,ada,ikenna,ona,obiorah
id,1,7,5,6,3
implement,hoe,matchet,cutlass,rake,shovel
address,ishaga,iju,oshodi,lekki,agege
height,187,192,143,155,167


~~Converting a list to pandas dataframe incurs you turning every other input such as columns and index to a list. This is so it can fit in with the existing architecture in place.~~

~~Let's see if that's true by converting a dictionary to a df~~

That was wrong. Everything goes in as a list item.

For a list to dataframe, the key automatically becomes the columns.

In [35]:
new2 = pd.DataFrame([2, 3, 4, 5],
                   index=[1,2,3,4])

In [36]:
new2

Unnamed: 0,0
1,2
2,3
3,4
4,5


## at, iat, loc and iloc

**at,** **iat**, **iloc** and **loc** are accessors used to retrieve data in dataframes. **iloc** selects values from the rows and columns by using integer index to locate positions while **loc** selects row or columns using labels. **at** and **iat** are used to retrieve single values such that **at** uses the column and row labels and **iat** uses indices. 

In [39]:
df1[2:]

Unnamed: 0,a,b,c,d,e
implement,hoe,matchet,cutlass,rake,shovel
address,ishaga,iju,oshodi,lekki,agege
height,187,192,143,155,167


Indexes in pandas are immutable arrays with unique elements or can be described as ordered sets for retrieving data in a dataframe and collaborating with multiple dataframes.

The important Pandas functionalities: indexing, reindexing, selection, group, drop entities, ranking, sorting, duplicates and indexing by hierarchy.

In [40]:
# performing iloc and loc
df1

Unnamed: 0,a,b,c,d,e
name,obi,ada,ikenna,ona,obiorah
id,1,7,5,6,3
implement,hoe,matchet,cutlass,rake,shovel
address,ishaga,iju,oshodi,lekki,agege
height,187,192,143,155,167


In [58]:
# selecting the row at index 3. Remember iloc selects based on original index (0,1,2...)
df1.iloc[3] # iloc takes only integer values

a    ishaga
b       iju
c    oshodi
d     lekki
e     agege
Name: address, dtype: object

In [59]:
# loc selects based on index labels. My labels here are name, id, ...
df1.loc['address'] # loc takes any value depending on the ndex

a    ishaga
b       iju
c    oshodi
d     lekki
e     agege
Name: address, dtype: object

In [47]:
df1['a']
# from here, one can see that, you can use df[] to get column contents.
# you can't apply same logic to the rows without using loc, iloc, at or iat.

name            obi
id                1
implement       hoe
address      ishaga
height          187
Name: a, dtype: object

In [48]:
# selecting 'hoe' using at and iat

In [52]:
# iat works like iloc. It uses the default index to locate items (row 3, column 1). Python starts at 0
df1.iat[2, 0] 

'hoe'

In [53]:
# at works like loc. It uses the index and column labels to locate contents (row:implement, col:a)
df1.at['implement', 'a']

'hoe'

In [54]:
df1.iloc[2, 0]

'hoe'

In [56]:
df1.loc['implement', 'a']

'hoe'

In [57]:
# if loc and at, and iloc and iat do the same thing, what's the use of at and iat??

To see more potentials of loc, iloc, at and iat see this thread: https://stackoverflow.com/a/47098873

### Summary and descriptive statistics: measure of central tendency, measure of dispersion, skewness and kurtosis, correlation and multicollinearity

Similar to Numpy, Pandas has some functions that provide descriptive statistics such as the measures of central tendency, dispersion, skewness and kurtosis, correlation and multicollinearity. Some functions are mode(), median(), mean(), sum(), std(), var(), skew(), kurt() and min(). The describe function gives the summary  of the numeric columns in a dataframe displaying count, mean, standard deviation, interquartile range, minimum and maximum values.

In [64]:
df1.loc['height'].max()

192

In [67]:
df1.loc['height'].kurt()

-2.119435211002499

# Pandas missing value

Pandas represent missing values as NA or NaN which can be filled, removed and detected with functions like fillna(), dropna(), isnull(), notnull(), replace().

* check for missing values .isnull()
* drop rows/columns with missing values .dropna()
* convert your missing value to nan .replace('?', np.nan) # an example
* fill blank spaces with some values .fillna() fillna() has several arguments, mean, mode(most common freq, mcf)

Remember, when you use inplace=True so it can adjust your original df

For fillna(), you can also use a library.
Example:
from sklearn.impute import SimpleImputer
df = SimpleImputer(missing_value = np.nan, **strategy = 'mean'**).fit_transform(df)
Strategy can be set to: mean of the column, or the median of the column.
More on this later.


# Data Wrangling

The pandas library is vast enough to read data from and save to several file formats such as CSV, JSON, HTML and even databases.

* csv = pd.read_csv('sample_file.csv') #read file

* csv.to_csv('sample_file.csv', index=False) #write file

* excel = pd.read_excel('sample_excel.xlsx') #read file

* excel.to_excel('sample_file.xlsx') #write to excel

* html = pd.read_html('http://www.webpage.com/data.html') #read table from html webpage

* html.to_html('sample_file.html')

**for complex table imports, Tablau can extract tables from pdf files

A dataframe can be easily categorised into different segments based on a given criteria using the groupby() function. This initially splits the dataframe into the groups then applies a function to the groups after which the results are combined.

First entry in a grouby() can be displayed using **.first()**

Merging in Pandas has different methods (merge()). pd.merge(how='inner', how='outer')

Concatenation is performed with the concat() function by combining series or dataframes while keeping the indices of the individual unit irrespective of duplicate indices.

Concatenation is performed with the concat() function by combining series or dataframes while keeping the indices of the individual unit irrespective of duplicate indices.

In [69]:
# what is the diff btw merge() and concat()??
# fuel.duplicated().any()

#Merge function explanation and examples: https://www.shanelynn.ie/merge-join-dataframes-python-pandas-index-1/

## Data Visualisation and Representation

Anscombe Quartet identifies that different datasets can have the same or very identical statistical properties such that they can be labelled the same but when graphed, they are seen to have different distributions. 



# Web Scraping

##### library to generate user agent
from user_agent import generate_user_agent

#generate a user agent
headers = {'User-Agent': generate_user_agent(device_type="desktop", os=('mac', 'linux'))}

#headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.63 Safari/537.36'}
page_response = requests.get(page_link, timeout=5, headers=headers)

##### requesta timeout

#timeout is set to 5 seconds

page_response = requests.get(page_link, timeout=5, headers=headers)


##### IP rotation

proxies = {'http' : 'http://10.10.0.0:0000',  
          'https': 'http://120.10.0.0:0000'}
          
page_response = requests.get(page_link, proxies=proxies, timeout=5) 

### Good web scraping reference (css selectors): 

https://www.learndatasci.com/tutorials/ultimate-guide-web-scraping-w-python-requests-and-beautifulsoup/

In [1]:
for i in range(1, 11):
    current_url = 'http://www.website.com/?p={page_num}'.format(page_num = i)
    print(current_url)

http://www.website.com/?p=1
http://www.website.com/?p=2
http://www.website.com/?p=3
http://www.website.com/?p=4
http://www.website.com/?p=5
http://www.website.com/?p=6
http://www.website.com/?p=7
http://www.website.com/?p=8
http://www.website.com/?p=9
http://www.website.com/?p=10


In [4]:
# location = [e.get_text().strip() for e in entries]. After defining entries (forloop)

#entries = soup.find_all('address', {'class':'voffset-bottom-10'})
#text = [e.get_text().strip() for e in entries]
#list.append(text)

In [5]:
'''
locations_list = []
for location in locations:
    location = soup.findAll('address', {'class':'voffset-bottom-10'})
    location = [e.get_text().strip() for e in entries]
    locations_list.append(location)
locations_list[0]
'''

"\nlocations_list = []\nfor location in locations:\n    location = soup.findAll('address', {'class':'voffset-bottom-10'})\n    location = [e.get_text().strip() for e in entries]\n    locations_list.append(location)\nlocations_list[0]\n"

In [1]:
# feat = normalized.loc[:, ['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area', 
#              'Overall Height', 'Orientation', 'Glazing Area', 'Glazing Area Distribution']]

In [4]:
man = [['boy', 'girl', 'male', 'female'], [1, 5, 4, 9], ['boy', 'girl', 'female', 'male']]
df = pd.DataFrame (man, columns = ['one', 'two', 'three', 'four'])
df

Unnamed: 0,one,two,three,four
0,boy,girl,male,female
1,1,5,4,9
2,boy,girl,female,male


In [5]:
df.loc[df.one=='boy', 'one'] = 0
df.loc[df.two=='girl', 'two'] = 1
df.loc[df.three=='male', 'three'] = 2
df.loc[df.three=='female', 'three'] = 3
df.loc[df.four== 'female', 'four'] = 3

In [6]:
df

Unnamed: 0,one,two,three,four
0,0,1,2,3
1,1,5,4,9
2,0,1,3,male


In [1]:
# Define a function which returns features whose correlation is above a certain threshold value
# (passed as an input parameter to our function) 

def get_features(correlation_threshold):
    max_corrs = correlations.abs()
    high_correlations = max_corrs
    high_correlations =  max_corrs[max_corrs > correlation_threshold].index.values.tolist()

    return high_correlations

In [None]:
# Display coefficients of each feature

coefficients = pd.DataFrame(coef_, features) 
coefficients.columns = ['Coefficient']
print(coefficients)

## These coefficients show the impact of each value on the 'quality' as the target when all other features are fixed.
## an increase of 1 in alcohol will lead an increase of 0.28 in the Quality
## A 1 increase of 1 in chlorides will cause a decrease of 1.34 in Quality