<img src="standard-vertical-black.png" width="150">

## School of Computer Science

# CS5901 Programming in Python
## Practical 2 
##### Credits: 30% of the module

## Aim / Learning objectives



The objectives of this practical are:

* Learn how to investigate, clean and prepare data
* Analyse the complexity of algorithms 


## Set-up

You are **only allowed** to use the following imported packages for this practical. No off-the-shelf machine learning packages such as _scikit-learn_ are allowed. 


In [2]:
import pandas as pd
import numpy as np
import time
import gc
import psutil
import os

### Data

- You will use the data in **P2data????.csv**, where **????** denotes the last four digits of your matriculation number. This is obtained by emailing **cs5901.staff** 
- An existing dataset will be sampled and sent you you in CSV format

### Instructions

**Stage 0. Code repository**

- Code repos are an essential resource for the modern programmer
- Normally you'd push all your commited changes to a server, but we'll skip this step
- Everything below assumes use of *git*, but you can use any other suitable repository tool

0. Install *git* locally (or the repo of your choice)
    - https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
1. Create a repo
    - https://git-scm.com/book/en/v2/Git-Basics-Getting-a-Git-Repository
2. Add your initial code to the repo, and commit changes as you update the code
    - https://www.atlassian.com/git/tutorials/saving-changes/git-commit
3. Include your git log in the notebook part of the submission
    - https://git-scm.com/docs/git-log

Again, don't *git push* anything if you are working locally

**Stage 1. Data cleaning and profiling**

0. Import the data as a pandas dataframe
1. Identify and remove any data rows that make no sense
2. Replace any missing values in T3 and T4 with average values **for their specific level**
3. Write code - without using *pandas.describe()* - that presents a table of descriptive statistics for each column, then compare you results to those obtained using *pandas.describe()*
4. Write code that identifies any repeated rows or confirms that there are none

**Stage 2. Time and space complexity**

- The code below is an undocumented example of one way to find the CPU time and RAM space needed to perform calculations on randomly generated data, and investigate how the time and space needed grows as the size of the problem increases

1. Using adaptations of the code below, or the method of your choice, investigate the time and space needed for:
    - standard matrix multiplication (for time this should be $cn^3$ where $c$ is a non-negative constant)
    - sorting an unordered list of integers inefficiently by searching for the smallest element, then the next smallest,...
    - a comparison of Python's *find()* method and your loop-based implementation of a method that checks if one string is a substring of another

In [3]:
def myInv(size):
    intA = np.random.randint(-500,50000, (size,size))
    flA = np.random.rand(size,size)
    A = intA + flA
    b = np.random.randint(-10,100,size)
    return(np.linalg.inv(A))

process = psutil.Process(os.getpid())
baseRam = process.memory_info().rss

resTime = np.zeros(10)
resSpace = np.zeros(10)

for i in range(1,11):
    gc.collect()
    start = time.time()
    myInv(i*1000)
    end = time.time()
    ram = process.memory_info().rss    
    resTime[i-1] = end - start
    resSpace[i-1] = ram - baseRam
    
print(resTime)
print(resSpace)

[ 0.10660076  0.22539568  0.63430429  1.31847382  2.08841562  3.65023923
  5.11335135  7.91480541 10.14985824 14.21795225]
[ 5840896. 27615232. 54591488. 66453504. 77721600. 80125952. 83804160.
 84221952. 87863296. 91459584.]


In [20]:
intA = np.random.randint(-500,50000, (2,2))
intA

array([[36732, 19066],
       [16203, 41399]])

In [21]:
flA = np.random.rand(2,2)
flA

array([[0.79731159, 0.34920528],
       [0.65719854, 0.34261068]])

In [22]:
flA + intA

array([[36732.79731159, 19066.34920528],
       [16203.65719854, 41399.34261068]])

In [24]:
b = np.random.randint(-10,100,2)
b

array([77, -6])

## Key points

- All your code should be in fully documented *.py* files
- Your notebook should import and demonstrate the code
- For task two, you need to understand what *myInv()* does
- For task two, a scatter plot of *(time,size)* and/or *(space,size)* is often a good way to visualise the complexity
- For most tasks, there is no single correct answer - the idea is to show that you can write code that provides a better understanding of both data and the complexity of methods used in Data Science
- As for the first practical, the aim is to use the smallest number of external libraries in order to develop core Python skills
    - If you have problems generating inline plots in Jupyter without using *matpotlib*, then it is acceptable to also import this package.

## Sumbission

Upload two things via Moodle:

1. A single Python file containing all your documented code

2. A Jupyter notebook that imports your code, demonstrates its use, and presents any results or insights obtained. You should also discuss any design decisions you made in the notebook



## Assessment Criteria

Marking will follow the guidelines given in the school student handbook (see link in next section). 



## Policies and Guidelines


### Marking
See the standard mark descriptors in the School Student Handbook:
http://info.cs.st-andrews.ac.uk/student-handbook/learning-teaching/feedback.html#Mark_Descriptors


### Good academic practice
The University policy on Good Academic Practice applies:
https://www.st-andrews.ac.uk/students/rules/academicpractice/
