# Introduction to Python <a class="anchor" id="top"></a>

**Simon Hall**

This notebook is an introduction to the world of Python. We'll begin with a review of data types, operators, functions, and then move to the main data structures used in Python: *tuples*, *lists*, *dictionaries* and *sets*. We will also look at more advanced features, such as lambdas and list comprehensions.

## Contents:
* [1. Data Types](#first-bullet)
     * [Numbers](#numbers-bullet)
     * [Strings](#strings-bullet)
     * [Booleans](#bools-bullet)
     * [None and NaN](#none-bullet)
     * [Coverting Types](#conv-bullet)
* [2. Operators](#second-bullet)
* [3. Functions](#third-bullet)
     * [Chaining Functions](#chains-bullet)
     * [Mapping Functions](#maps-bullet)
     * [Lambdas](#lambdas-bullet)
* [4. IF Statements and Loops](#fourth-bullet)
     * [IF Statements](#ifs-bullet)
     * [FOR Loops](#fors-bullet)
     * [WHILE Loops](#whiles-bullet)
* [5. Data Structures](#fifth-bullet)
     * [Tuples](#tuples-bullet)
     * [Lists](#lists-bullet)
     * [Dictionaries](#dicts-bullet)
     * [Sets](#sets-bullet)
* [6. NumPy and Pandas](#sixth-bullet)
     * [NumPy](#numpy-bullet)
     * [Pandas](#pandas-bullet)
* [7. Classes](#seventh-bullet)
* [8. Web Scraping](#eighth-bullet)

## 1. Data Types <a class="anchor" id="first-bullet"></a>

[TOP ↑](#top)

Python is an object-oriented language, so data is typically stored and manipulated in objects or structures. The data contains values which have their own **type**, which tells us whether that value is a number, a string, or a boolean. You can check the type of a variable using the **type()** function.

### Numbers <a class="anchor" id="numbers-bullet"></a>

[TOP ↑](#top)

There are two number types: *integer* and *float*.

In [2]:
# The integer type is referred to as int
count = 8
type(count)

int

In [2]:
# All decimals are float type
distance = 5
time = 8
velocity = distance / time
type(velocity)

float

In [3]:
print(velocity)

0.625


### Strings <a class="anchor" id="strings-bullet"></a>

[TOP ↑](#top)

A *string* is a character or text type.

In [6]:
# The string type is referred to as str
first_name = "Simon"
type(first_name)

str

In [8]:
# You can use single or double quotes
surname = 'Hall'
type(surname)

str

We can output strings using the **print()** function. This requires the output variables to be strings. However, in order to place other variables inside a string for output, we can use format strings or f-strings, generated by putting the letter f just before a string. If we prefix a string with f we can add a variable using curly brackets. The interpreter will then convert the variable to a string.

In [1]:
# Formerly, without using f-strings, we could do this:
name = "Simon"
greeting = "Hello " + name
print(greeting)

# The output must be a string, so we often have to convert using str()
answer = 42
line = "The answer to life, the universe and everything is " + str(answer)
print(line)

# The easiest method is to use an f-string
line = f"The answer to life, the universe and everything is {answer}"
print(line)

Hello Simon
The answer to life, the universe and everything is 42
The answer to life, the universe and everything is 42


### Booleans <a class="anchor" id="bools-bullet"></a>

[TOP ↑](#top)

A *boolean* is a logical True/False data type.

In [2]:
# The boolean type is referred to as bool
prime = True
type(prime)

bool

### None <a class="anchor" id="none-bullet"></a>

[TOP ↑](#top)

The ```None``` keyword is used to define a null value, or no value at all. This is a special data type in Python, called ```NoneType```. It is not equivalent to 0, nor False, nor an empty string. It's its own thing. 

In [1]:
#None is its own data type
x = None
type(x)

NoneType

### Converting Types <a class="anchor" id="conv-bullet"></a>

[TOP ↑](#top)

Each data type has a corresponding function which converts any value to that type so long as there's a sensible way to do so. Python will sometimes automatically convert between types for us. As we have seen, for example, *ints* will automatically be converted to *floats* if a division operation leaves a decimal place. The principle functions are **int()**, **bool()**, **float()** and **str()**.

In [5]:
# The int() function converts a value to an integer.
# Note that this function completely ignores the decimal place, it doesn't round

population_ireland = 4.9

population_ireland_int = int(population_ireland)

print(f"The population of Ireland is approximately {population_ireland_int} million?")

population_rounded = round(4.9)
print(f"Actually, the population of Ireland is approximately {population_rounded} million.")

The population of Ireland is approximately 4 million?
Actually, the population of Ireland is approximately 5 million.


In [6]:
# An old convention means that bool types can be converted to ints, where False is 0 and True is 1.

coin1_heads = True
coin2_heads = True
coin3_heads = False

total_heads = coin1_heads + coin2_heads + coin3_heads

print(f"In total we flipped {total_heads} heads.")

In total we flipped 2 heads.


In [8]:
# Strings can also be converted to numbers

sides_of_triangle_string = "3"
print(f"sides_of_triangle_string is of data type: {type(sides_of_triangle_string)}")

sides_of_triangle = int(sides_of_triangle_string)
print(f"sides_of_triangle is of data type: {type(sides_of_triangle)}")

sides_of_triangle_string is of data type: <class 'str'>
sides_of_triangle is of data type: <class 'int'>


You can also use the **bool()**, **float()** and **str()** functions to convert to the other types.

## 2. Operators <a class="anchor" id="second-bullet"></a>

[TOP ↑](#top)

Operations are mappings or functions that act on elements to produce other elements. Operators are symbols which represent these operations. They take one or more values and produce some result. The most common operators are things like **+**, **-**, **&ast;** and **/** which add, subtract, multiply and divide values.

In [5]:
arithmetic_addition = 1 + 1
string_addition = "1" + "1"

print(f"When we're working with numbers, 1 + 1 = {1 + 1}.")
print(f"When we're working with strings, 1 + 1 = {'1' + '1'}!")
print("If we mix a string and a number we get an error.")
print(1 + int("1"))

When we're working with numbers, 1 + 1 = 2.
When we're working with strings, 1 + 1 = 11!
If we mix a string and a number we get an error.
2


Python also supports comparison operators. The most common comparison operators are **&gt;**, **&lt;**, **&ge;**, **&le;**, **==**, and **!=**. The one which trips people up the most is remembering the double equals sign. The double equals sign checks for equality, rather than assigning a value. != is the opposite of ==, returning True if the operands are different.

In [74]:
print(f"5 > 7 : {5 > 7}")
print(f"6 >= 6 : {6 >= 6}")
print(f"14 <= 12 : {14 <= 12}")
print(f"5 == 5 : {5 == 5}")
print(f"5 != 5 : {5 != 5}")
print(f"8 == '8' : {8 == '8'}")

5 > 7 : False
6 >= 6 : True
14 <= 12 : False
5 == 5 : True
5 != 5 : False
8 == '8' : False


Boolean operators allow us to combine two boolean values using **and**/**or**. Unlike many programming languages which use ampersands, &amp;, and vertical bars, |, for this operator, Python uses the plain English words **and** and **or**. We can use the keyword **not** to flip a bool from True to False or False to True.

In [15]:
t = True
f = False

print(f"*and* returns true if both operands are true: t and f is {t and f}, t and t is {t and t}")
print(f"*or* returns true if either operand is true: t or f is {f or not t}, f or not t is {f or not t}")

*and* returns true if both operands are true: t and f is False, t and t is True
*or* returns true if either operand is true: t or f is False, f or not t is False


### Modulo Arithmetic

The modulo operator, **&#37;**, divides the first number by the second and returns the remainder. This operator comes from the days when you couldn't automatically convert types and would need to check the remainder after performing any kind of division.

In [9]:
# 7/5 = 1 with remainder 2 

remainder = 7%5
print(remainder)

2


In [16]:
print(f"5 divided by 2 is {int(5 / 2)} with {5 % 2} left over")

# If a % b is 0 then b divides into a evenly

for number in [1, 3, 4, 5, 12]:
    if number % 2 == 0:
        print(f"{number} is even")
    else:
        print(f"{number} is odd")

5 divided by 2 is 2 with 1 left over
1 is odd
3 is odd
4 is even
5 is odd
12 is even


## 3. Functions <a class="anchor" id="third-bullet"></a>

[TOP ↑](#top)

A function is a re-usable block of code. We need to *define* our function using the **def** keyword, followed by the name of the function. If the function is going to take any inputs we also need to give them a name, these are included in parentheses. Functions are useful because they cut down the amount of code we have to write, they make our code more readable, it's easy to re-use functions in other scripts, and they make it easier to test and debug our code.

In [17]:
# We can create a function which tells us whether a number is even or not

def isEven(number):
    if number % 2 == 0:
        return True
    else:
        return False
    
print("Before we can use a function we need to define it")
print("When we want to use a function we need to call it")
print(f"2 is even? {isEven(2)}")
print(f"3 is even? {isEven(3)}")

Before we can use a function we need to define it
When we want to use a function we need to call it
2 is even? True
3 is even? False


In [18]:
# We can create another function which takes two inputs and tells us which is largest
# If both are the same it returns "None", the Python equivalent of null

def theLargerOf(first, second):
    if first > second:
        return first
    elif second > first:
        return second
    else:
        return None
    
print(f"The larger number of 2 and 12 is {theLargerOf(2, 12)}")
print(f"The larger number of 8 and 3 is {theLargerOf(8, 3)}")
print(f"The larger number of 6 and 6 is {theLargerOf(6, 6)}")

# This will become 14 + 5
theLargerOf(6, 14) + theLargerOf(3, 5)

The larger number of 2 and 12 is 12
The larger number of 8 and 3 is 8
The larger number of 6 and 6 is None


19

### Chaining Functions <a class="anchor" id="chains-bullet"></a>

[TOP ↑](#top)

Pythonic code is terse. Hence, chaining functions is very common.

In [2]:
def double(number):
    return number * 2

x = 1
x = double(x)
x = double(x)
x

4

This code assigns a value of 1 to x and then doubles it twice (quadruples it). We can put all of this on a single line using function chaining.

In [6]:
x = double(double(1))
x

NameError: name 'double' is not defined

The python interpreter starts from the innermost function and works its way out. Behind the scenes this line of code unfolds like this

```python
x = double(double(1))
# First the interpreter evaluates double(1) (which gives 2)
x = double(2)
# Then the interpreter evalues double(2) (which gives 4)
x = 4
# Finally it assigns the value 4 to the variable x
```

### Mapping Functions <a class="anchor" id="maps-bullet"></a>

[TOP ↑](#top)

It's very common in Python to want to apply a function to every element in a list. Lists and other compound data structures will be covered shortly. We come up against this kind of situation when we're normalising data, for example. We could do it using a loop, but it's not very readable and often not very efficient.

In [19]:
# Create a list which doubles the value of every item in numbers
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

def double(number):
    return number * 2

# Create a new list to hold our doubled numbers
doubles = []
for number in numbers:
    doubled_number = double(number)
    doubles.append(doubled_number)
    
# Jupyter Notebook prints the last line of a cell.
doubles

[2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

This works just fine, but it's quite a lot of code. It's difficult to tell exactly what's going on without reading through the code line by line. The python **map()** function allows us to take a function and apply it to every value in a list. Using the **map()** function we can reduce the number of lines of code significantly.

In [10]:
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

def double(number):
    return number * 2

# The first parameter to map is the name of the function we want to apply
# The second parameter is a list of values to apply it to
doubles = map(double, numbers)

In [11]:
type(doubles)

map

In [12]:
# We need to convert the output of map() back to a list to print it
list(doubles)

[2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

In the example above we tell the Python interpreter that we'd like to run every number in the numbers list through the *double()* function. The map function will create a new list for us with the output of that function for each item. Note that when we want to print the output we need to convert it back to a list using the **list()** function.

### Lambdas <a class="anchor" id="lambdas-bullet"></a>

[TOP ↑](#top)

We can reduce the lines of code here even further using a **lambda**. A lambda is a one-line anonymous function. It gives us a short-hand way to write a function. It has the following syntax:

```python
lambda x: f(x)
````

The following are equivalent.

In [None]:
def double(number):
    return number * 2

lambda number: number * 2

You can see above that a lambda leaves out the **def** and **return** keywords. It doesn't have a name so all we have to write is the name of the parameter and what we'd like to return. We use the keyword **lambda** to tell the interpreter this is a lambda function. By convention, people tend to use **x** as the parameter name when writing a lambda. Putting it altogether we have

In [21]:
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

doubles = map(lambda x: x * 2, numbers)

# We need to convert the output of map() back to a list to print it
list(doubles)

[2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

## 4. IF Statements and Loops <a class="anchor" id="fourth-bullet"></a>

[TOP ↑](#top)

As with most programming languages, Python supports the use of conditional ```if``` statements utilising the comparison operators and logical conditioning, as well as the ubiquitous ```for``` and ```while``` loops.

### IF Statements <a class="anchor" id="ifs-bullet"></a>

[TOP ↑](#top)

In Python, an ```if``` statement has the following syntax: 

```python
if condition:
    do something
```

We can also choose to use ```elif``` (else if) and ```else``` key words to include other conditions. 

### FOR Loops <a class="anchor" id="fors-bullet"></a>

[TOP ↑](#top)

In Python, a ```for``` loop is used for iterating over compound data structures (covered in the next section), either a list, a tuple, a dictionary, a set, or a string, and has the following syntax: 

```python
for variable_name in list:
    do something
```

We can choose any variable name we like in a for loop, though it's best to make it descriptive.

### WHILE Loops <a class="anchor" id="whiles-bullet"></a>

[TOP ↑](#top)

In Python, a ```while``` loop can execute a set of statements as long as some condition holds true, and has the following syntax: 

```python
while condition:
    do something
```

We can also use the key word ```break``` to break out from a loop even while the condition remains true, ```continue``` to interrupt the current iteration and move to the next one, and ```else``` to run a block of code once the condition is no longer true.

## 5. Data Structures <a class="anchor" id="fifth-bullet"></a>

[TOP ↑](#top)

We've so far mostly been working with fundamental, primitive data types: *int*, *float*, *str*, and *bool*. However, most real-world problems are easier to solve when these simple data types are combined into more complex data structures.

A data structure is a collection of data, such as a list of numbers or a record in a database. Python has four built-in data structures: **tuples**, **lists**, **dictionaries**, and **sets**. Each has its own characteristics and features.

### Tuples <a class="anchor" id="tuples-bullet"></a>

[TOP ↑](#top)

A tuple is used to store multiple items in a single variable. It's a collection which is ordered, indexed, and which permits duplicate values. However, tuples are immutable. These objects cannot be changed or altered, and are often useful for containing data which must be retained in its original form. A tuple is written with parentheses, or round brackets: **( )**.

In [None]:
my_tuple = ("red", "green", "blue")

You can also use the *tuple()* constructor:

In [None]:
new_tuple = tuple(("red", "green", "blue"))

### Lists <a class="anchor" id="lists-bullet"></a>

[TOP ↑](#top)

A list is also used to store multiple items in a single variable. It's a collection which is also ordered, indexed, and which permits duplicate values. However, unlike tuples, lists are very much mutable. A list is written with square brackets: **[ ]**. Each item in the list has an **index**, or position within that list. In Python, we start counting from 0 rather than 1. As a result, python lists are **zero-indexed**.

In [21]:
my_list = ["red", "green", "blue", "black"]

#### Indexing

Items can be accessed using square bracket notation. Python also allows us to count backwards from the end of the list if we prefer. An index of -1 gives us the last item, -2 gives us the second last item.

In [22]:
print(f"The first colour is: {my_list[0]}")
print(f"The second colour is: {my_list[1]}")
print(f"The third colour is: {my_list[2]}")
print(f"The final colour is: {my_list[-1]}")

The first colour is: red
The second colour is: green
The third colour is: blue
The final colour is: black


#### Slicing

If we want to access multiple items we can use slicing: my_list[a:b] gives us every item from index a (inclusive) to index b (exclusive).

In [32]:
print("We can get the second and third items using [1:3]")
print(my_list[1:3]) # Gives us items 1 and 2

We can get the second and third items using [1:3]
['green', 'blue']


We can get everything up to a specified index by leaving the left-hand-side of the colon blank.

In [28]:
print("We can get the first three items using [:3]")
print(my_list[:3])

We can get the first three items using [:3]
['red', 'green', 'blue']


In [33]:
print("We can get the first two items using [:-2]")
print(my_list[:-2])

We can get the first two items using [:-2]
['red', 'green']


We can get everything starting from some index to the end by leaving the right-hand-side of the colon blank.

In [34]:
print("We can get the final two items using [-2:]")
print(my_list[-2:])

We can get the final two items using [-2:]
['blue', 'black']


We can get all items as follows:

In [19]:
print("We can get all items using [:]")
print(my_list[:])

We can get all items using [:]
['red', 'green', 'blue']


#### Dynamic Lists

We can create a list using a list *literal*, the square bracket notation. The problem with this method is that we need to know every value at the time we create the list. If we want to create a list *dynamically* we can use the list append() method.

In [38]:
numbers = range(1, 21) # This is exclusive: [1, 2, 3,...,19,20]
evens = [] # Create an empty list

for number in numbers:
    if number % 2 == 0:
        evens.append(number)
        
print(evens)

[2, 4, 6, 8, 10, 12, 14, 16, 18, 20]


The append() method inserts an item at the end of the list. Alternatively, you may want to insert an item into a specific position. The insert() method allows you to specify where you want to insert the item. The syntax is:

```python
my_list.insert(index, item)
````

The code above inserts *item* into list *my_list* at position *index*.

Finally, items can be removed using the remove() method. It looks through the list and deletes the first occurrence of that value.

```python
my_list.remove(item)
````

#### List Comprehensions

A list comprehension is a short-hand way of defining lists, which is quite similar to a loop. It allows for the construction of a new list based on lists that already exist. The inspiration comes from mathematical set comprehension notation (or set-builder notation), where the set is defined, for example, as {x ∈ ℝ | x > 3}. This is the set of all real numbers x such that x is greater than zero. In Python, we can build a new list in the following way:

```python
new_set = [f(x) for x in old_set]
```

We can also include conditions, such as if statments inside the comprehension. Let's take a concrete example:

In [39]:
odds = [1, 3, 5]
evens = [x + 1 for x in odds]
print(evens)

[2, 4, 6]


In [53]:
randoms = [31, 6, 9, 23, 54, 11, 1]
doubled_randoms = [x*2 for x in randoms]
print(doubled_randoms)

[62, 12, 18, 46, 108, 22, 2]


In [54]:
# The same thing can be accomplished using a lambda

doubled_randoms_lambda = map(lambda x: x*2, randoms)

# We need to convert the output of map() back to a list to print it
list(doubled_randoms_lambda)

[62, 12, 18, 46, 108, 22, 2]

#### Sorting

We can use the sort() method to sort a list. In the example below we are using the random module to generate a list of random numbers, which we then sort.

In [45]:
import random

random.randint(0, 30) # This returns a random number between 0 and 30 (inclusive)

random_numbers = [random.randint(0, 30) for x in range(0, 10)] # Generates a list of 10 random numbers between 0 and 30

print("List of random numbers:")
print(random_numbers)

print("Sorted random numbers:")
random_numbers.sort()
print(random_numbers)

List of random numbers:
[10, 29, 17, 17, 30, 19, 24, 21, 7, 15]
Sorted random numbers:
[7, 10, 15, 17, 17, 19, 21, 24, 29, 30]


### Dictionaries <a class="anchor" id="dicts-bullet"></a>

[TOP ↑](#top)

In Python, a dictionary contains a look-up **key** and a corresponding **value**. It's a collection which is ordered and mutable. It does not permit duplicate items. A dictionary is written with curly brackets: **{ }**. The keys are strings, but the values can take any data type.

The syntax is:

```python
{key1: value1, key2: value2, key3: value3}
```

In [49]:
my_dictionary = {"Simon": "Hall", "Brad": "Pitt", "Tom": "Hardy"}
print(f"People's first and second names: {my_dictionary}")

People's first and second names: {'Simon': 'Hall', 'Brad': 'Pitt', 'Tom': 'Hardy'}


Alternatively, we can use dict().

In [57]:
surnames = dict()
surnames["Simon"] = "Hall"
surnames["Brad"] = "Pitt"
surnames["Tom"] = "Hardy"
print(f"People's first and second names: {surnames}")

People's first and second names: {'Simon': 'Hall', 'Brad': 'Pitt', 'Tom': 'Hardy'}


#### Retrieving Values

We can retrieve items from a dictionary just like we would from a list, using square brackets and passing in the key.

In [58]:
print(surnames["Brad"])

Pitt


#### Adding Items

Adding items to a dictionary is very similar to retrieving them. Again, we use square bracket notation.

In [59]:
surnames["Matt"] = "Damon"
print(surnames)

{'Simon': 'Hall', 'Brad': 'Pitt', 'Tom': 'Hardy', 'Matt': 'Damon'}


#### Deleting Items

We can use the key word, **del**.

In [60]:
del surnames["Brad"]
print(surnames)

{'Simon': 'Hall', 'Tom': 'Hardy', 'Matt': 'Damon'}


#### Keys & Values

If we want to extract the keys and values from a dictionary we can use .keys() and .values(). We can also use .items() to get both.

In [61]:
all_keys = surnames.keys()
for key in all_keys:
    print(key)

Simon
Tom
Matt


In [62]:
all_values = surnames.values()
for value in all_values:
    print(value)

Hall
Hardy
Damon


In [64]:
all_items = surnames.items()
for key, value in all_items:
    print(f"{key}, {value}")

Simon, Hall
Tom, Hardy
Matt, Damon


We can also use the .get() method. The syntax is:

```python
dictionary.get(keyname, value)
```

### Sets <a class="anchor" id="sets-bullet"></a>

[TOP ↑](#top)

A set is a collection which is unordered and unindexed. The set items are immutable, but items can be added or removed. No duplicate items are permitted. A set is written with curly brackets, **{ }**, like a dictionary, but does not have keys.

In [75]:
my_set = {"apple", "banana", "cherry"}
print(my_set)

{'banana', 'apple', 'cherry'}


We can also use the set() method.

In [76]:
fruits = set(["apple", "banana", "cherry"])
print(fruits)

{'banana', 'apple', 'cherry'}


We can add() and remove(), etc. 

#### Searching

We can use the key word **in**:

In [77]:
print(f"Does our set contain the fruit 'banana'? {'banana' in fruits}")

Does our set contain the fruit 'banana'? True


#### Operations

The set operations *.union()*, *.intersection()*, *.difference()*, and *.symmetric_difference()* can be used to find combinations of sets.

In [79]:
threes = set(range(3, 31, 3))
print(f"Multiples of three: {threes}")

fives = set(range(5, 31, 5))
print(f"Multiples of five: {fives}")

print(f"Multiples of either 3 or 5: {threes.union(fives)}")

Multiples of three: {3, 6, 9, 12, 15, 18, 21, 24, 27, 30}
Multiples of five: {5, 10, 15, 20, 25, 30}
Multiples of either 3 or 5: {3, 5, 6, 9, 10, 12, 15, 18, 20, 21, 24, 25, 27, 30}


## 6. NumPy and Pandas <a class="anchor" id="sixth-bullet"></a>

[TOP ↑](#top)

Python lists are very versatile, and list items can be anything from floats to strings and even lists themselves.  The NumPy package gives us efficient R-style vectors, called arrays, containing elements of the same type which are optimised for batch processing. NumPy is very useful for data processing in general. NumPy vectors are used as columns in the Pandas package to generate dataframes, similar to R. A Pandas dataframe consists of a collection of columns, where each column is a NumPy vector containing values.

### NumPy <a class="anchor" id="numpy-bullet"></a>

[TOP ↑](#top)

A NumPy array is similar to a list. However, there are a some key differences. Arrays have items all of the same type, and they add and multiply element-wise, exactly like vectors but unlike lists.

In [13]:
import numpy as np

l1 = [1,2,3]
l2 = [4,5,6]

a1 = np.array(l1)
a2 = np.array(l2)

print("List Addition:" + str(l1+l2))
print("Array Addition:" + str(a1+a2))
print("Array Multiplication:" + str(a1*a2))

List Addition:[1, 2, 3, 4, 5, 6]
Array Addition:[5 7 9]
Array Multiplication:[ 4 10 18]


**Generating Sequences**

In [16]:
# The linspace() function is inclusive, and produces floats
gen = np.linspace(10,100,10)
print(gen)

[ 10.  20.  30.  40.  50.  60.  70.  80.  90. 100.]


In [17]:
# The arange() function is exclusive, and produces ints
gen2 = np.arange(10,110,10)
print(gen2)

[ 10  20  30  40  50  60  70  80  90 100]


In [19]:
type(gen2[0])

numpy.int64

**Array Shape**

Arrays can be 1-dimensional lists with a single index. They can also be 2-dimensional arrays, commonly known as a matrix. Matrices consist of rows and columns, so require two indices to poinpoint a value, one for the row and one for the column, (r, c). Dataframes are similar to 2-D arrays, but not exactly the same. Matrices are arrays of pure numbers, for one thing, wherease dataframes can be of mixed data type. You can check a NumPy array's dimensionality using the shape property.

In [20]:
oneDValues = [1, 2, 3, 4, 5, 6, 7, 8, 9]
twoDValues = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

a3 = np.array(oneDValues)
a4 = np.array(twoDValues)
print("a3 has shape " + str(a3.shape))
print("a4 has shape " + str(a4.shape))

print("2D Array: " + str(twoDValues))
print("Element at (0,0) is " + str(a4[0, 0]))
print("Element at (1,1) is " + str(a4[1, 1]))
print("Element at (2,1) is " + str(a4[2, 1]))

a3 has shape (9,)
a4 has shape (3, 3)
2D Array: [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
Element at (0,0) is 1
Element at (1,1) is 5
Element at (2,1) is 8


NumPy arrays can be sliced using a similar syntax to regular list slicing. Remember that lists were sliced using [start_index:end_index]. We can also provide an additional step parameter to allow us to choose the step by which we increase the index [start_index:end_index:step].

In [21]:
# Take every second item from the oneDValues array
print(oneDValues[0:9:2])

[1, 3, 5, 7, 9]


In [22]:
# second row, third column of a4
print(a4[1,2])

# all rows, first column of a4
print(a4[0:3,0])

# all rows, last column of a4
print(a4[0:3,2])

6
[1 4 7]
[3 6 9]


### Pandas <a class="anchor" id="pandas-bullet"></a>

[TOP ↑](#top)

Pandas, short for Panel Data, is a library in Python which allows for working with dataframes. It's quite similar to the R implementation of dataframes. Pandas represents a table as a dataframe where each column in a is a NumPy array and each row is implemented as a Pandas Series.

**Data Types**

Pandas has several different data types, including integers (```int8```, ```int16```, ```int32```, ```int64```, all of which can be negative or positive, plus ```uint8```, ```uint16```, ```uint32```, and ```uint64``` which are "unsigned" and solely positive), floats (```float64```), bools (```bool```), date/time (```datetime64```), date/time differences (```timedelta```), finite categorical data (```category```), and object type (```object```) which is used to store strings and combinations of numeric and non-numeric data. Finally, in addition to ```None```, which is a primitive data type in Python, there is also ```NaN``` (Not-a-Number), which is used to indicate missing values in both NumPy and Pandas. ```NaN``` is an IEEE floating-point value which represents null values in numerical data. In NumPy and Pandas, an object's ```dtype``` specifies the data type and the size of the elements inside a NumPy array or a Pandas column. 

#### Pandas Series

This is the primary data structure in Pandas. A Series is an array-like data structure where each element is assigned a unique label, known as an index, and where each element can be accessed either by name (its label) or by numbered index. 

In order to create a Series we need to supply both the series values and the labels. We can keep our labels and values in python lists and pass them as parameters to the series function, or we can use a python dictionary, which can be converted directly into a series.

In [23]:
import pandas as pd

labels = ['a','b','c']
values = np.array([10,20,30])

s1 = pd.Series(data=values, index=labels)
print(s1)

a    10
b    20
c    30
dtype: int64


In [24]:
rowDict = dict({'a': 10, 'b': 20, 'c': 30})
s2 = pd.Series(rowDict)
print(s2)

a    10
b    20
c    30
dtype: int64


#### Accessing Values

We have three ways of accessing values in a series. 

1. Using the label name with dot notation (like the dollar ($) sign in R).
2. Using the label names with square bracket notation.
3. Using the label index with square bracket notation.

In [2]:
#1 dot notation
print("The label 'a' has value: " + str(s1.a))

#2 square bracket name notation
print("The label 'a' has value: " + str(s1['a']))

#3 square bracket column index
print("The label 'a' has value: " + str(s1[0]))

NameError: name 's1' is not defined

#### DataFrames

A Pandas DataFrame is a two-dimensional table-like data structure consisting of rows and columns, where each column can contain data of a different type (e.g., numeric, string, or boolean), and each row represents a unique observation or record. A DataFrame is made up of a collection of Pandas Series objects, where each Series represents a column of data in the DataFrame.

The easiest way to create a a Pandas DataFrame is to use a list of python dictionaries. We've seen above that we can easily create a row (or Series from a python dictionary). The dictionary key is taken to represent the column name while the value represents the value for that row. By creating a list of dictionaries we can gather together our initial data for creating a dataframe.

In [27]:
# Create a dictionary
mutt = dict({"name": "Mutt", "breed": "Terrier", "height": 1.2})

# Create another dictionary in an alternative manner
bonnie = {}
bonnie["name"] = "Bonnie"
bonnie["breed"] = "Bichon"
bonnie["height"] = 0.6

# Append these dictionaries to a list, and convert to DataFrame
dogs = list()
dogs.append(mutt)
dogs.append(bonnie)
pd.DataFrame(dogs)

# We could shorten the above code to the following
dogs = [{"name": "Mutt", "breed": "terrier", "height": 1.2}, {"name": "Bonnie", "breed": "bichon", "height": 0.6}]
pd.DataFrame(dogs)

Unnamed: 0,name,breed,height
0,Mutt,terrier,1.2
1,Bonnie,bichon,0.6


We can also create a dataframe directly from a CSV file, using pd.read_csv().

In [28]:
df = pd.read_csv("iris.csv")
df

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica


We can also copy a dataframe using copy(), df2 = df.copy().

#### Exploration

In [31]:
df.shape

(150, 5)

In [29]:
df.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [30]:
df.tail()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica
149,5.9,3.0,5.1,1.8,Virginica


#### Accessing Values

In [34]:
df.variety

0         Setosa
1         Setosa
2         Setosa
3         Setosa
4         Setosa
         ...    
145    Virginica
146    Virginica
147    Virginica
148    Virginica
149    Virginica
Name: variety, Length: 150, dtype: object

In [35]:
df["variety"]

0         Setosa
1         Setosa
2         Setosa
3         Setosa
4         Setosa
         ...    
145    Virginica
146    Virginica
147    Virginica
148    Virginica
149    Virginica
Name: variety, Length: 150, dtype: object

In [37]:
df[["variety", "sepal.length"]]

Unnamed: 0,variety,sepal.length
0,Setosa,5.1
1,Setosa,4.9
2,Setosa,4.7
3,Setosa,4.6
4,Setosa,5.0
...,...,...
145,Virginica,6.7
146,Virginica,6.3
147,Virginica,6.5
148,Virginica,6.2


#### Indexing and Slicing

We can select rows from a pandas dataframe using the .loc and .iloc properties. The .loc method allows us to select rows and columns by name, whereas .iloc allows us to select rows and columns by index. Usually rows have a numeric index, although it is possible to give rows a named index. If the index is numeric (as is default) we can use row indexing with .loc.

In [38]:
# Get the first five rows with .loc, selecting the columns we want by name
# Note that .loc is inclusive
df.loc[0:4, ["variety", "sepal.length", "sepal.width"]]

Unnamed: 0,variety,sepal.length,sepal.width
0,Setosa,5.1,3.5
1,Setosa,4.9,3.0
2,Setosa,4.7,3.2
3,Setosa,4.6,3.1
4,Setosa,5.0,3.6


In [39]:
# Get the first five rows with .iloc, selecting the columns we want by index
# Note that .iloc is exclusive
df.iloc[0:5, [4,0,1]]

Unnamed: 0,variety,sepal.length,sepal.width
0,Setosa,5.1,3.5
1,Setosa,4.9,3.0
2,Setosa,4.7,3.2
3,Setosa,4.6,3.1
4,Setosa,5.0,3.6


We can also use Boolean logic to access information. This is known as logical indexing. It's much more common that we might want to select all rows meeting a certain condition (e.g. all rows where the label is True). The .loc function also lets us pass an array of True/False values, and will return only rows corresponding to True in the array.

In [40]:
# Use logical indexing to select the first and third rows only
students = pd.DataFrame([['d19122334', 'John', 'Smith'], ['d19155667', 'Jane', 'Doe'], ['c18155334', 'Enda', 'Smith']], columns=["StudentNo", "FirstName", "LastName"])
print(students.loc[[True, False, True]])

   StudentNo FirstName LastName
0  d19122334      John    Smith
2  c18155334      Enda    Smith


We can use this to create complex queries to retrieve certain rows in our dataset.

In [42]:
isSmith = students["LastName"] == "Smith"

# Gives us [True, False, True]
print(isSmith)

print(students.loc[isSmith])

# Putting all of this together in one row we get
students.loc[students["LastName"] == "Smith"]

0     True
1    False
2     True
Name: LastName, dtype: bool
   StudentNo FirstName LastName
0  d19122334      John    Smith
2  c18155334      Enda    Smith


Unnamed: 0,StudentNo,FirstName,LastName
0,d19122334,John,Smith
2,c18155334,Enda,Smith


We can build up complex queries using the python logical-and and logical-or operators. If combinations are used,  we must put both sides of the equation inside parentheses.

In [43]:
students.loc[(students["LastName"]=="Doe")|(students["FirstName"]=="John")]

Unnamed: 0,StudentNo,FirstName,LastName
0,d19122334,John,Smith
1,d19155667,Jane,Doe


In [48]:
# Get all rows in the iris dataset with sepal length less than 6.0
df.loc[df["sepal.length"]<6.0]

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
...,...,...,...,...,...
113,5.7,2.5,5.0,2.0,Virginica
114,5.8,2.8,5.1,2.4,Virginica
121,5.6,2.8,4.9,2.0,Virginica
142,5.8,2.7,5.1,1.9,Virginica


#### Dropping Rows & Columns

We can drop rows and columns using the drop() method. If we're dropping a columns we need to set the axis parameter to tell Pandas we're dropping a column and not a row. In general axis 0 mean rows and axis 1 means columns.

In [49]:
df2 = df.copy()
df2 = df.drop("variety", axis=1)
df2

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


#### Summary Statistics

We can find summary statistics using min(), max(), mean(), median(), mode(), std(), sum(). We can also call NumPy functions, such as percentile.

In [52]:
df2.mean()

sepal.length    5.843333
sepal.width     3.057333
petal.length    3.758000
petal.width     1.199333
dtype: float64

In [50]:
np.percentile(df["sepal.length"], 75)

6.4

**Dealing with NaNs**

We need to watch out for missing values. NaNs in the dataset can cause issues. We can remove rows or columns containing NaN values using the **dropna()** method. By default, this method drops any row containing at least one NaN value. You can also specify the axis and threshold for dropping NaN values. However, dropping NaNs can lead to loss of valuable information and may not always be the best strategy. We can also replace NaN values with a specified value using the **fillna()** method. For example, you can replace all NaN values with 0 using **df.fillna(0)**. You can also use other methods, such as forward or backward filling, where missing values are replaced by the value from the preceding or proceeding row or column. You can also interpolate NaN values using **interpolate()** to fill in missing values with interpolated values based on neighboring data points. This can be useful when dealing with time-series data. Finally, we can simply keep the NaN values and work around them by using logical operations like **isna()** and **notna()** to check for missing values in your data.

## 7. Classes <a class="anchor" id="seventh-bullet"></a>

[TOP ↑](#top)

Python is an object-oriented programming language. Therefore, almost everything in Python is an object, with its associated properties and methods. A **class** is an object constructor, or a "blueprint" for creating new objects.

In order to create a new object, we must use the class keyword, with the following syntax:

```python
class NameOfClass:
    body
```

All classes have a special function, `__init__()`, which is executed when the class is being initiated. It's used to assign values to object properties, or other operations that are necessary when the object is being created. We can, for example, create a class named Company, to which we assign values for its name and year of establishment:

```python
class Company:
  def __init__(self, name, established):
    self.name = name
    self.established = established

company1 = Company("KPMG", 1987)

print(p1.name)
print(p1.established)
```

The `__init__()` function is called automatically every time the class is being used to create a new object. The `self` parameter is a reference to the current instance of the class, and is used to access variables that belongs to the class.

Objects in Python can also contain methods, which are functions that belong to the object.

```python
class Company:
  def __init__(self, name, established):
    self.name = name
    self.established = established
    
  def myfunc(self):
    print("The company name is " + self.name)

company1 = Company("KPMG", 1987)

p1.myfunc()
```

## 8. Web Scraping <a class="anchor" id="eighth-bullet"></a>

[TOP ↑](#top)

https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/

We want to be able to scrape information from the worldwide web.

In [55]:
import requests

# Make a request to https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/
# Store the result in 'res' variable
res = requests.get(
    'https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/')
txt = res.text
status = res.status_code

print(txt, status)
# print the result

<!DOCTYPE html>
<html lang="en">
	<head>
		<!-- Anti-flicker snippet (recommended)  -->
		<style>
			.async-hide {
				opacity: 0 !important;
			}
		</style>
		<title>codedamn Web Scraper demo</title>
		<meta charset="utf-8" />
		<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />

		<meta
			name="keywords"
			content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper, "
		/>
		<meta name="description" content="The most popular web scraping website." />
		<link
			rel="icon"
			sizes="128x128"
			href="/webscraper-python-codedamn-classroom-website/favicon.png"
		/>

		<meta name="viewport" content="width=device-width, initial-scale=1.0" />

		<link rel="stylesheet" href="/webscraper-python-codedamn-classroom-website/app.css" />

		<link
			rel="apple-touch-icon"
			href="/webscraper-python-codedamn-classroom-website/logo-icon.png"
		/>

		<script defer src="/webscraper-python-codedamn-classroom-website/app.js"></script>
	</head>
	<body>
		<header r

In [57]:
import requests
from bs4 import BeautifulSoup

# Make a request to https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/
page = requests.get(
    "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")
soup = BeautifulSoup(page.content, 'html.parser')

# Extract title of page
page_title = soup.title.text

# print the result
print(page_title)

codedamn Web Scraper demo


We saw how we can extract the title from the page. It is equally easy to extract out certain sections too.

You also saw that you have to call .text on these to get the string, but you can print them without calling .text too, and it will give you the full markup. Try to run the example below:

In [58]:
import requests
from bs4 import BeautifulSoup

# Make a request
page = requests.get(
    "https://codedamn.com")
soup = BeautifulSoup(page.content, 'html.parser')

# Extract title of page
page_title = soup.title.text

# Extract body of page
page_body = soup.body

# Extract head of page
page_head = soup.head

# print the result
print(page_body, page_head)

<body class="font-body"><div data-reactroot="" id="__next"><div id="root"><div class="relative" id="layout-conntainer"><header class="z-[51] relative cd-morph-dropdown text-gray-100 bg-gradient-to-r from-gray-900 via-gray-800 to-gray-900"><div class="jsx-b35e882c88056c79 relative py-4 max-w-7xl mx-auto flex items-center justify-between px-4 sm:px-6 group"><a class="jsx-b35e882c88056c79 flex flex-grow lg:flex-grow-0 sm:space-x-2 items-center" data-testid="logo" href="/"><div class="jsx-b35e882c88056c79"><span style="box-sizing:border-box;display:inline-block;overflow:hidden;width:initial;height:initial;background:none;opacity:1;border:0;margin:0;padding:0;position:relative;max-width:100%"><span style="box-sizing:border-box;display:block;width:initial;height:initial;background:none;opacity:1;border:0;margin:0;padding:0;max-width:100%"><img alt="" aria-hidden="true" src="data:image/svg+xml,%3csvg%20xmlns=%27http://www.w3.org/2000/svg%27%20version=%271.1%27%20width=%2735.13%27%20height=%27

Now that you have explored some parts of BeautifulSoup, let's look how you can select DOM elements with BeautifulSoup methods.

Once you have the soup variable, you can work with .select on it which is a CSS selector inside BeautifulSoup. That is, you can reach down the DOM tree just like how you will select elements with CSS. Let's look at an example:

.select returns a Python list of all the elements. This is why you selected only the first element here with the [0] index.

In [59]:
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
    "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")
soup = BeautifulSoup(page.content, 'html.parser')

# Create all_h1_tags as empty list
all_h1_tags = []

# Set all_h1_tags to all h1 tags of the soup
for element in soup.select('h1'):
    all_h1_tags.append(element.text)

# Create seventh_p_text and set it to 7th p element text of the page
seventh_p_text = soup.select('p')[6].text

print(all_h1_tags, seventh_p_text)

['Test Sites', 'E-commerce training site'] 7 reviews


Let's go ahead and extract the top items scraped from the URL: https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/

If you open this page in a new tab, you’ll see some top items. In this lab, your task is to scrape out their names and store them in a list called top_items. You will also extract out the reviews for these items as well.

To pass this challenge, take care of the following things:

Use .select to extract the titles. (Hint: one selector for product titles could be a.title)
Use .select to extract the review count label for those product titles. (Hint: one selector for reviews could be div.ratings) Note: this is a complete label (i.e. 2 reviews) and not just a number.
Create a new dictionary in the format:

```python
info = {
   "title": 'Asus AsusPro Adv...   '.strip(),
   "review": '2 reviews\n\n\n'.strip()
}
```

Note that you are using the strip method to remove any extra newlines/whitespaces you might have in the output. This is important to pass this lab.
Append this dictionary in a list called top_items
Print this list at the end

In [60]:
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
    "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")
soup = BeautifulSoup(page.content, 'html.parser')

# Create top_items as empty list
top_items = []

# Extract and store in top_items according to instructions on the left
products = soup.select('div.thumbnail')
for elem in products:
    title = elem.select('h4 > a.title')[0].text
    review_label = elem.select('div.ratings')[0].text
    info = {
        "title": title.strip(),
        "review": review_label.strip()
    }
    top_items.append(info)

print(top_items)

[{'title': 'Asus AsusPro Adv...', 'review': '7 reviews'}, {'title': 'Asus ROG Strix G...', 'review': '4 reviews'}, {'title': 'Acer Aspire 3 A3...', 'review': '2 reviews'}]


Note that this is only one of the solutions. You can attempt this in a different way too. In this solution:

First of all you select all the div.thumbnail elements which gives you a list of individual products
Then you iterate over them
Because select allows you to chain over itself, you can use select again to get the title.
Note that because you're running inside a loop for div.thumbnail already, the h4 > a.title selector would only give you one result, inside a list. You select that list's 0th element and extract out the text.
Finally you strip any extra whitespace and append it to your list.

In [61]:
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
    "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")
soup = BeautifulSoup(page.content, 'html.parser')

# Create top_items as empty list
all_links = []

# Extract and store in top_items according to instructions on the left
links = soup.select('a')
for ahref in links:
    text = ahref.text
    text = text.strip() if text is not None else ''

    href = ahref.get('href')
    href = href.strip() if href is not None else ''
    all_links.append({"href": href, "text": text})

print(all_links)

[{'href': '', 'text': 'Toggle navigation'}, {'href': '/webscraper-python-codedamn-classroom-website/', 'text': ''}, {'href': '#page-top', 'text': ''}, {'href': '/webscraper-python-codedamn-classroom-website/', 'text': 'Web Scraper'}, {'href': '/webscraper-python-codedamn-classroom-website/cloud-scraper', 'text': 'Cloud Scraper'}, {'href': '/webscraper-python-codedamn-classroom-website/pricing', 'text': 'Pricing'}, {'href': '#section3', 'text': 'Learn'}, {'href': '/webscraper-python-codedamn-classroom-website/documentation', 'text': 'Documentation'}, {'href': '/webscraper-python-codedamn-classroom-website/tutorials', 'text': 'Video Tutorials'}, {'href': '/webscraper-python-codedamn-classroom-website/how-to-videos', 'text': 'How to'}, {'href': '/webscraper-python-codedamn-classroom-website/test-sites', 'text': 'Test Sites'}, {'href': 'https://forum.webscraper.io/', 'text': 'Forum'}, {'href': 'https://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn?hl=en', '

Finally, let's understand how you can generate CSV from a set of data. You will create a CSV with the following headings:

Product Name
Price
Description
Reviews
Product Image
These products are located in the div.thumbnail. The CSV boilerplate is given below:

In [62]:
import requests
from bs4 import BeautifulSoup
import csv
# Make a request
page = requests.get(
    "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")
soup = BeautifulSoup(page.content, 'html.parser')

# Create top_items as empty list
all_products = []

# Extract and store in top_items according to instructions on the left
products = soup.select('div.thumbnail')
for product in products:
    name = product.select('h4 > a')[0].text.strip()
    description = product.select('p.description')[0].text.strip()
    price = product.select('h4.price')[0].text.strip()
    reviews = product.select('div.ratings')[0].text.strip()
    image = product.select('img')[0].get('src')

    all_products.append({
        "name": name,
        "description": description,
        "price": price,
        "reviews": reviews,
        "image": image
    })


keys = all_products[0].keys()

with open('products.csv', 'w', newline='') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_products)