## Python Libraries and Numpy

In this notebook, we will take a look at Python libraries in general and a specific library
that is useful in data science, called **numpy**.

A library is a collection of functions and tools that typically serve a related purpose. They
save you from having to write a lot of code yourself. In Python there are libraries, modules and
packages, which all serve the same general role but are technically different. However, for ease of
discussion, we will use the term library to cover all of these.

To use a library, you must first import it using the **import** keyword.

For example, the library **requests** allows you to make http requests from the web. This means that you can write code to get webpages and read their contents. This is illustrated below. The requests library is a useful tool in webscraping, which we will discuss later in the course. Feel free to change the URL to something else.


In [48]:
# import the library requests and give it the name 'r'
import requests as r

# declare the string url, giving it the URL of the webpage you want to get
url = "http://www.bbc.co.uk"
# Get the webpage with the given URL and store it in the variable 'webpage'
webpage = r.get(url)
# Print the status code of the web request. A status code of 200 means OK.
print(webpage.status_code)
# Print the first 1000 characters of the webpage content
print(webpage.text[0:1000])

200
<!DOCTYPE html><html lang="en-GB" class="no-js"><head><meta charSet="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" /><title data-rh="true">BBC - Home</title><meta data-rh="true" name="description" content="The best of the BBC, with the latest news and sport headlines, weather, TV &amp; radio highlights and much more from across the whole of BBC Online."/><meta data-rh="true" name="theme-color" content="#FFFFFF"/><meta data-rh="true" property="fb:admins" content="100004154058350"/><meta data-rh="true" property="og:description" content="The best of the BBC, with the latest news and sport headlines, weather, TV &amp; radio highlights and much more from across the whole of BBC Online."/><meta data-rh="true" property="og:image" content="https://static.files.bbci.co.uk/core/website/assets/static/bbc/images/metadata/poster-1024x576.efe9db7f43.png"/><meta data-rh="true" property="og:image:alt" content="BBC logo"/><meta data-rh="true" property="og:site_name" c

## Dir function

The **dir** function allows you to see all the functions, methods and attributes associated with a particular object or library. This list can be very long, but it is useful sometimes to check what is available and for confirming naming conventions. After running the code below, change the argument from 'webpage' to 'r' to see a list of functions and attributes associated with the request object.

In [49]:
# List the attributes and methods of the webpage object created above
# You will see the attribute 'status_code' and 'text' which we
# used above
dir(webpage)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

## Numpy

The **numpy** library (pronounced num - pie) is a library of numerical functions and tools that are using in scientific computing, data science and machine learning.

Let's import it and use some of its features. We will start by looking at what you can do with numpy arrays, which are basically lists of numbers (or some other type of data).

In [59]:
# Import the numpy library and refer to it by the name 'np'
import numpy as np

# Create a numpy array from a list
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
# The shape of an array is how many rows and columns. In this case, it is just 10 items long.
x.shape

(10,)

In [60]:
# Calculate some statistics - average value, standard deviation, max value, min value
print(f"The mean (average) is {x.mean():.2f}")
print(f"The standard deviation is {x.std():.2f}")
print(f"The max value is {x.max():.2f}")
print(f"The min value is {x.mean():.2f}")


The mean (average) is 5.50
The standard deviation is 2.87
The max value is 10.00
The min value is 5.50


In [61]:
# Increase every value by 1
print(x)
x = x+1
print(x)

[ 1  2  3  4  5  6  7  8  9 10]
[ 2  3  4  5  6  7  8  9 10 11]


In [62]:
# Multiply every value by 3 and add 1
print(x)
x = 3*x + 1
print(x)

[ 2  3  4  5  6  7  8  9 10 11]
[ 7 10 13 16 19 22 25 28 31 34]


In [63]:
# Take the square root of each number
print(np.sqrt(x))

[2.64575131 3.16227766 3.60555128 4.         4.35889894 4.69041576
 5.         5.29150262 5.56776436 5.83095189]


## Combining and Slicing arrays

You can combine two numpy arrays. You can also use indexing to reference parts of an array.


In [64]:
# Create two different numpy arrays called p and q
p = np.array([1, 2, 3, 4, 5])
q = np.array([0, 4, 5, 3, -1])

#Add the elements of the arrays together
print(p+q)

# Combine the elements by adding 2 times the first array to 3 times the second array value
print(2*p+3*q)

# Slice the array q so that we keep the first 3 elements
r = q[0:3]
print(r)

[1 6 8 7 4]
[ 2 16 21 17  7]
[0 4 5]


## Loading data from a file

You can load data from a CSV (Comma Separated Variable) file and store it in a numpy array.

Let's load some test scores from the file 'test_scores.csv'. This file has 3 rows of data, each representing
scores on 3 different tests for a class of 12 students. The first test was out of 10, the second out of 20 and the third out of 100. Each column gives the scores for a particular student.

In [65]:
file = "test_scores.csv"
d = np.loadtxt(file, delimiter=",")
print(d)

[[ 1.  3.  3.  9.  1. 10.  9.  7.  7.  4. 10.  9.]
 [ 9. 20. 13.  7.  1. 13. 14.  1. 16. 18.  9.  4.]
 [97. 75.  5. 72. 76. 24. 12. 94. 67. 21. 85. 67.]]


In [67]:
# The shape of the array d will tell you there are 3 rows each with 12 columns.
d.shape


(3, 12)

In [70]:
# We'll put each row in its own array corresponding to the 
test1 = d[0]
test2 = d[1]
test3 = d[2]
print(test1, test2, test3)

[ 1.  3.  3.  9.  1. 10.  9.  7.  7.  4. 10.  9.] [ 9. 20. 13.  7.  1. 13. 14.  1. 16. 18.  9.  4.] [97. 75.  5. 72. 76. 24. 12. 94. 67. 21. 85. 67.]


Let's do some data processing on these test scores. 

1. Firstly, we want to remove the last two students from each set of scores as they dropped the class.
2. Secondly, we want to rescale the first two scores so they are out of 100.
3. Thirdly, we will add scores together so we get a total score for each student over the 3 tests.
4. Finally, we will divide all the scores by 3 so we get an average out of 100.


In [72]:
test1 = test1[0:10]
test2 = test2[0:10]
test3 = test3[0:10]
print(test1, test2, test3)

[ 1.  3.  3.  9.  1. 10.  9.  7.  7.  4.] [ 9. 20. 13.  7.  1. 13. 14.  1. 16. 18.] [97. 75.  5. 72. 76. 24. 12. 94. 67. 21.]


In [73]:
test1 = test1*10
test2 = test2*5
print(test1, test2, test3)

[ 10.  30.  30.  90.  10. 100.  90.  70.  70.  40.] [ 45. 100.  65.  35.   5.  65.  70.   5.  80.  90.] [97. 75.  5. 72. 76. 24. 12. 94. 67. 21.]


In [74]:
total = test1+test2+test3
print(total)

[152. 205. 100. 197.  91. 189. 172. 169. 217. 151.]


In [77]:
total_average = total/3
print(total_average)

[50.66666667 68.33333333 33.33333333 65.66666667 30.33333333 63.
 57.33333333 56.33333333 72.33333333 50.33333333]


In [80]:
# We don't like all the decimals, so we will round each score to 1 decimal place
tot_ave_rnd = np.around(total_average, decimals=1)
print(tot_ave_rnd)

[50.7 68.3 33.3 65.7 30.3 63.  57.3 56.3 72.3 50.3]


In [82]:
# Finally, let's work out the highest, lowest and average total score for the 10 students
print(tot_ave_rnd.min())
print(tot_ave_rnd.max())
print(tot_ave_rnd.mean())

30.3
72.3
54.75
