# JBS Python Crash Course

In session 1, we'll try and get through the following:

- Basics: Storing data in variables, data types, basic variable manipulation

- Functions: What they are, how to make one, how to use one

- Modules: What they are, why they are, how to use one

- Useful modules: NumPy, pandas, scikit-learn

- Basics of 'machine learning', discussion, and a simple house-price model

In session 2, we'll start looking at the data you will be given for the actual assignment later on in the course.

*NOTE: In case you are curious, yes this is still an iPython notebook 'cell', but I have converted it to a text-formatted cell by clicking 'Cell' -> 'Cell Type' -> 'Markdown'. Markdown is basically a simple text formatting language.*

## OK let's look at some variables

In [None]:
# This line is ignored by the Python interpreter because it starts with a '#'

# let's create two variables, one called x and another called y
x = 2
y = 3
z1 = x + y
print(z1) 
# Congratulations, you just used your first function

In [None]:
# NOTE: Everything (apart from these comments) in Python is case sensitive 

# Question: What will happen now?
z2 = X + y

# Also note the error message; error messages are your friends

In [None]:
# In this cell, let's think of some other common mathematical operations to try








In [None]:
# There are quite a few different 'data types' in Python

# If you are used to Excel, you probably didn't have to worry about data types

# In Python you sometimes do have to worry, but not always, for example

my_variable1 = 1 # an integer

print(type(my_variable1))

my_variable2 = 1.0

print(type(my_variable2)) # a floating point number

my_variable1 + my_variable2

# Wonderful, it works, this is called type 'promotion', the int was promoted to
# a float so they could be added together

# also note that when we did print(type()) we were using 'nested' functions

In [None]:
# What about non-numerical data types? 

a_word = 'happy'

another_word = 'birthday, my friend.'

a_sentence = a_word + another_word

a_sentence

# Woops, something is wrong with that sentence. How can we fix it?

In [None]:
# And, what about collections of things?

# Let's try the simplest type, a list, we can have a list of anything

some_numbers = [1, 2.0, 3, 4, 1000] # note they are mixed types
some_words = ["hey", "business", "like", "Boris Johnson"]

some_things = ["hey", "give me a number", 5, "5"]

print(some_things)

print(type(some_things[2]))

# What about this one? GUESS FIRST!
# print(type(some_things[3]))

# NOTE THAT LISTS ARE '0' INDEXED WHICH MEANS THEY START AT POSITION 0
# print(some_things[0])

# There are other 'collection' types in Python, can you name any?

In [None]:
# Hold on, I tried to create a variable and it didn't work?

# First, have a look at the 'Identifier Naming' section of this e-book https://python.swaroopch.com/basics.html

# Let's try an example that breaks the rules discussed in the above link:

a[variable = 5

In [None]:
# What about this one?

_aVario___AWrkm = 5

## Let's have another think about functions

In [None]:
# What does this do?

type(5)

In [None]:
# How about this?

print("hey")

In [None]:
# They are both FUNCTIONS. You can make your own functions, for example

def myfirstfunction(x, y):
    return (x + y)

# note that it can have as many arguments as you want (within reason) and they can be called whatever you want (within reason)

# OK enough chit-chat, let's try and use it

myfirstfunction(1, 2)

In [None]:
# Let's try something more complicated

def mysecondfunction(x, y, z):
    sum1 = x + y
    sum2 = y + z
    return sum1 + sum2

mysecondfunction(1, 1, 1)

In [None]:
# Note that everything 'within' the function has to have the same level of spaces or indentation
# that's why this doesn't work

def abadfunction(x, y, x):
    sum1 = x + y
  sum2 = y + z
      return sum1 + sum2

abadfunction(1, 1, 1)

In [None]:
# Also note, if you assign/change variables 'within' a function, it doesn't affect variables 'outside' of the function 
# caveat (for now)

z = 1.0

def doubler(x):
    z = 50000
    return 2*x

print(doubler(5))

# What will this print? GUESS FIRST
# print(z)

## OK, OK! That's the basics of functions covered. What about Modules?

Well, what we have lots of functions? Hundreds of functions? And other more complicated components called 'classes'?

We might want to bundle them together in a neat way. A good way of doing this is by putting everything in a module.

We won't go through how to create a module today but it is not particularly complicated so if you are interested
I recommend you have a look at the other self-learning links I emailed previously.

What we will go through today however, is how to load in modules that other people have written. This is because we'll need some of these modules to complete some assignments later in the course.

In [None]:
# There are different ways you can load in a module. Let's start by trying to load a module called 'numpy'

import numpy

# great we did it, wonderful. Let's try and use the numpy module. Let's use the function that converts a regular 'list'
# into a special type of collection known as a numpy array.

our_list = [1, 2, 3]

print(type(our_list))

print(our_list)

our_array = numpy.asarray(our_list)

print(type(our_array))

print(our_array)

# They might look the same when they are printed out, but notice their types are different. This is important.
# It has major implications for computational performance which we can discuss further if you want.

In [None]:
# OK good work. But typing 'numpy.' before we access any of the numpy functions can get a bit cumbersome.
# We can improve the situation by giving numpy an alias. We do this in the following way:

import numpy as np

our_list2 = [1, 2, 3]
our_array2 = np.asarray(our_list2) # notice the difference from the previous cell
print(type(our_array2))
print(our_array2)

# See, that's a bit better isn't it?

In [None]:
# What if we know we only need one or two functions from a module? We can import them directly.

from numpy import asarray

our_list3 = [1, 2, 3]
our_array3 = asarray(our_list3) # notice the difference from the previous cell
print(type(our_array3))
print(our_array3)

In [None]:
# And finally (not normally recommended), what if we want to import all the functions from a module so that
# we can use them without any prefix?

from numpy import *

our_list4 = [1, 2, 3]
our_array4 = asarray(our_list4) # SAME AS PREVIOUS CELL, BUT NOW WE'VE IMPORTED ALL OF THE NUMPY FUNCTIONS
print(type(our_array4))
print(our_array4)


## OK great. Now we can load modules. We've introduced numpy, let's dig a bit deeper.

numpy will be useful to us in this course. So will some other modules. The other ones I can think of from the top of my head are 'pandas' and 'scikit-learn'. So what do these modules provide?

numpy has the basic array functionality that we will depend on for computational speed.

pandas is key to many data-centric applications. It will enable you to load in data very quickly and manipulate the data with ease.

scikit-learn is a machine learning module for python. It will be particularly important for the assignment you will eventually be asked to do for the course.

### In this section we will briefly show some numpy and pandas functionality. Elements of scikit-learn will be introduced later as needed.

In [None]:
# let's start from where we left off, with our array
# there are lots of numpy functions we can use on our array (see https://numpy.org/doc/1.17/reference/index.html)

print(np.square(our_array)) # get the square elements of all previous elements

print(np.sum(our_array)) # sum all elements

print(np.flip(our_array, 0)) # flip the order of elements along the 0th dimension

print(np.linalg.norm(our_array)) # calculate the norm, (1**2 + 2**2 + 3**2)**(1/2)

# the above is just the simplest numpy functions I could think of. I encourage you to browse the documentation to what is
# possible.

In [None]:
# let's use pandas to import some tabular data. If you have excel (or an open-source equivalent), let's open it there first
# so you can get a visual understanding of the data. The file is 'housepricedata.csv'. Now, let's also open it up in a
# basic word editor (e.g. notepad) and see what we can say about that.

# finally! Let's use pandas

import pandas as pd

housepricedata = pd.read_csv("housepricedata.csv") # this function just reads the file if we give it the filename

# wonderful, now let's have a look it here
print(housepricedata)

In [None]:
# OK, is there any irrelevant information here? There doesn't seem to be a correlation between estate agent and price,
# or number of teapots in the house and house price. Maybe we can get rid of those two columns?

del housepricedata['Estate Agent']
del housepricedata['Number of Teapots']

print(housepricedata) # that looks a bit more sensible!

In [None]:
# OK, as we only have two dimensions here, what's a sensible thing to do? Well, humans can easily visualise 
# 3 dimensions or less. Why don't we try and plot the data? When doing data analysis, we don't always have this luxury.
# Make the most of it.

# tell plotter to show the plot below the cell. (NOTE THIS COMMAND ONLY WORKS IN IPYTHON NOTEBOOKS)
%matplotlib inline 

# plot the data
housepricedata.plot(x='Day Purchased', y='Price', kind="scatter")

In [None]:
# Hmm, what does this look like to you. It looks roughly like a straight line to me.

# So let's try and fit a straight line regression model using scikit-learn.
# Relevant documentation https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares
# READ THE DOCUMENTATION! IT IS YOUR FRIEND! READ THE DOCUMENTATION! IT IS YOUR FRIEND! READ THE DOCUMENTATION...

# import a linear model from scikit-learn
from sklearn import linear_model

# create the model
reg = linear_model.LinearRegression()

# separate inputs and outputs for clarity
inputs = np.asarray(housepricedata['Day Purchased'])
# when we get the data it is a single row, lots of columns
print(inputs)
# but we need a single column, all rows
inputs = inputs.reshape(-1, 1)
print(inputs)

outputs = np.asarray(housepricedata['Price'])

# fit the model to the data
reg.fit(inputs, outputs)

# model is now fitted, let's see how well it fits the data
predicted = reg.predict(inputs)

# we don't have nice dataframes anymore, just separate arrays, so we need it to import our plotting module explicitly
import matplotlib.pyplot as plt

plt.plot(inputs, predicted)
plt.plot(inputs, outputs, marker='o', linewidth=0)

# congratulations, you are now a machine learning practitioner!

## The real challenge! Let's find some fraud!

Now we have the basic skills we need to attempt a more interesting problem, detecting bank fraud from annonymised data.

Let's have a go. The training data is in the file called 'd_training_set.csv'. Now there are many dimensions to the inputs. 

Not so easy to visualise.

Another key difference is that the output is binary. Either the transaction is fraudulent or its not. This is known as
a 'classification' problem.

As before, let's look at the data in Excel before we load it into Python.

In [None]:
# GAPS LEFT INBETWEEN LINES HERE SO YOU CAN TAKE NOTES AS WE DISCUSS

from sklearn.linear_model import LogisticRegression

traindata = pd.read_csv("d_training_set.csv")

Y = traindata['Class']

del traindata['Class']

del traindata['row_id']

# (NOTE TO DISCUSS RANDOM_STATE AND SOLVER)
model = LogisticRegression(random_state=0, solver='lbfgs')

fittedmodel = model.fit(traindata, Y)

In [None]:
# Now we have fitted our model, we need to make predictions on 'unseen data' whose outputs we don't know

testdata = pd.read_csv("d_online_test_set.csv")

rowids = np.array( testdata['row_id'] )

del testdata['row_id']

predictions = fittedmodel.predict(testdata)

In [None]:
# Now we have our predictions, we just need to do some data wrangling to get it in a suitable format for submission
# (two columns, many rows. Each column should be comma separated)

towrite = np.column_stack((rowids, predictions))

np.savetxt("predicted.csv", towrite, delimiter=',', fmt='%d')

## Well done! 

Now we can open up our 'predicted.csv' file (wordpad is best on Windows, not sure on Mac) and copy paste our predictions into the online form at

https://concertotest.com/david/demo/test/fraudprediction

I get an accuracy of approximately 90%. Not too bad for a quick attempt.

Your job later in the course will be to try and develop a model that achieves a greater accuracy. Note that this doesn't have to strictly involve just Python extensions.

(Hint: you will need to look at more advanced models in scikit-learn, but to start with, consider which of the input columns
are more or less correlated with the outputs. Perhaps some columns do not correlate with the classification at all, and perhaps some inputs are correlated with each other, presenting duplicate information to the model. This is just a starting point. I haven't explored further so I cannot say whether the above is true or not, it is up to you to investigate)