#**Coding for AI - Workshop 1**

<p>This notebook will be divided in two parts:

**Module 1** covers the basics of working with Jupyter Notebooks and the language Python. The goal is to get the student familiar with the development environment and run Python code for data processes.

**Module 2** covers data libraries and packages available in Python, how to install, run data functions and evaluate results. This part introduces some of the most popular and widespread used libraries in data science and analytics processes:

<ul>
    <li>Pandas</li>
    <li>Numpy</li>
    <li>Matplotlib</li>
    <li>Seaborn</li>
</ul>

</p>



---


#**Module 1 - Python and the Jupyter Notebooks environment**

In this workshop, we introduce the Jupyter Notebooks environment, using the open source product Google Colaboratories, or Colab

*“Colab” is a product from Google Research. Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education. More technically, Colab is a hosted Jupyter notebook service that requires no setup to use, while providing access free of charge to computing resources.*




<h2>The Python Syntax</h2>


Let's start running some Python code to get familiar with the environment and practice with same basic Python functionality

In [52]:
x = 5
y = 20
z = 10
amount = 10
print(x)
x = x + amount + y + z
print(x)

5
45


In the next cell, we are generating a list of words (chart types) and creating a simple text file with it. Open the "Files" view on the left hand side toolbar to see the folder.
Notice indentation and how to embed comments within the code

In [None]:
#@title
chartList = ["bar","line","pie","flow","gantt"] # here we're declaring a list

# create an external file, read the list created above and load the elements into the text file
with open("charts.txt","w") as charts_file:
    for chart in chartList:  #notice the tabbed indent here, they are very important in Python
        charts_file.write(chart + "\n")  #and we indent again
    print("File updated")


<h3>Getting user input</h3>

Python has a function called input. We can use this function to get keyboard input and save it to a variable.

In [None]:
name = input("Please enter your name: ")
print("Hello " + name + ", welcome to Python in Colaboratory")

#**Section 1: Variables**
<p>Variables are used to store data. In some prog languages you need to define the 'type' of the variable, such as floating point. In Python we do not need to do this. All you need to do is name the variable, and when it's assigned, Python will assume the type </p>

In [None]:
test_variable = 1234
type(test_variable)

Variables in Python can have the following types:

* Text Type: str
* Numeric Types:	int, float, complex
* Sequence Types:	list, tuple, range
* Mapping Type:	dict
* Set Types:	set, frozenset
* Boolean Type:	bool
* Binary Types:	bytes, bytearray, memoryview   


Let's look at String variables:

In [None]:
test_string_variable = "Hello"

In [None]:
type(test_string_variable)

In [None]:
test_string_variable

In [None]:
print(test_string_variable)

Lets talk about the difference between the last two cells. Notice how on one we use the print statement and the other we do not. The print statement is meant to actually output the value in the variable. The one above does not technically say print out the value but the notebook assumes you want to know the value
***

Can I use single quotes with double quotes?

Use single-quotes for string literals, e.g. 'my-identifier', but use double-quotes for strings that are likely to contain single-quote characters as part of the string itself (such as error messages, or any strings containing natural language), e.g. "You've got an error!".



In [None]:
var1 = 'This is an example of using single-quotes for string variables'
var2 = "This is an example using double-quotes, let's practice more!!"
print(var1)
print(var2)

# Next case generates an error, cannot use a combination of quotes in an assignment
# fix this problem and re-run this cell.
var3 = 'Try this, mixing single and double quotes"

Working with **comments** is very useful when collaborating and sharing code with another programmers. Use the # sign for single-line comments, use three double-quotes for multi-line comments

In [None]:
# This is an example of a single-line comment

"""
This is an example of
a multi-line
comment
"""

<h4>Concatenating Strings</h4>

<p>We can also concatenate or combine strings together</p>

In [None]:
#Concat
first_name = "John"
last_name = "Smith"
print(first_name + " " + last_name)

<h4>Escaping characters</h4>
<p>You saw how the Double Quotes were not printed. This is because this is used to define the string. However we can still printout double quotes if we choose to. </p>

In [None]:
#Escape characters
print("\"If you are going through hell, keep going\", Winston Churchill")

The \ backslash character allows us to escape the double quote. This also can be done for other things such as tabs. This escape character allows us to be able to print out information in the correct way we want.

<h4>Placeholder fields</h4>

By using the curly brackets {} we can substitute values inside a string, see the examples here:

In [None]:
# Notice how .format is inside the print function
first_name = "John"
last_name = "Smith"
print("Version 0. Hello " + first_name + " " + last_name + ". Welcome to the website")



print("Version 1. Hello {0} {1}. Welcome to the website".format("John","Smith"))

print("Version 2. Hello {} {}. Welcome to the website".format("John","Smith"))
print("Version 3. Hello {var_0} {var_1}. Welcome to the website".format(var_0="John", var_1="Smith"))
print("Version 4. Hello {} {}. Welcome to the website".format(first_name,last_name))

print(f"Version 5 - F strings. Hello {first_name} {last_name}. Welcome to the website")

# We do not always need to have it as {0} the value inside the bracket can be empty such as {}.
# Then it will follow the normal method

In [None]:
first_name = "John"
last_name = "Smith"
print("Version 0. Hello " + first_name + " " + last_name + ". Welcome to the website")



<h4>Slicing strings</h4>
String are a sequence of characters. We can access an individual character or characters by 'slicing' the string

In [None]:
string = "The plot shows spikes over the last 12 months"
print(string[5])

<p>Why is the result "l" why not a "p" white space? Remember white spaces are also characters. And that is because everything in Python is zero indexed. So the "T" in "The" is actually position 0 and not position 1 </p>

**Quick string challenges**

Take the string saved in the variable "str_challenge" print the letter in position 4 and 9. Take those letters and fill in the blanks in this sentence "The value for position 4 is ? and the value for position 9 is ?"

You can 'unhide' the solution in the cells below but try to do it yourself first by editing the 'String challenge' code cell

In [None]:
#String challenge
str_challenge ="This is a challenge for you to see how well you understand replacement fields/concats and string indexing"

# type your code down here



In [None]:
#@title
str_4 = str_challenge[4-1]
str_9 = str_challenge[9-1]
print("The value for position 4 is '{0}' and the value for position 9 is '{1}'".format(str_4,str_9))

<h4> Integers & Floats </h4>

A integer is a whole number that can be positive or negative. No decimal points!!

A float is a positive or negative number with at least 1 decimal point

In [None]:
int_x = 5

In [None]:
print(int_x)

In [None]:
type(int_x)

In [None]:
float_x = 5.1

In [None]:
print(float_x)

In [None]:
type(float_x)

You can convert numbers from float to ints

In [None]:
int(float_x)

In [None]:
# simply change the type of a variable by assigning a different value
var = "python"

In [None]:
type(var)

In [None]:
var = 123
print(var)

**Pause. >>>> Let the instructor know you are at this point. Wait to resume with the next session**


---





#**Section 2: Collections**

<ul>
    <li>Lists</li>
    <li>Tuples</li>
    <li>Sets</li>
    <li>Dictionaries</li>



<h2>Lists</h2>

Lists are one of 4 built-in data types in Python used to store collections of data. They are used to store multiple items in a single variable.

In [None]:
colors = ["Red","Blue","Green"]
colors[1]

Above is a Python list which is similar to an array from other programming languages. Even though they look similar to arrays they behave very differently.

Notice that collections in Python also use the **zero-index** reference to reference elements, index 1 is the second element in the list above.

We can 'extend' a list by concatenating them, see below:

In [None]:
#Extending a list
colors_two = ["Yellow","Purple","Orange"] # create a new list
colors.extend(colors_two) # concatenate with the first one, created above.
colors

In [None]:
colors_three = colors.copy()
colors_three.extend('green') # see what happens if we try to add an element to the list

In [None]:
colors_three # what is this ???

What happens here? the function "extend" is used to perform operations (concatenate) lists, if you add an element like 'green', extend takes this string as a list in itself, with 5 elements: [g, r, e, e, n] and add them to the new list.


In [None]:
colors.extend(['green']) # use squared brackets to set green as a one-element list.
print(colors)

Let's look at more functions with lists.

To add one element using extend, use a list of one element
We can add and remove elements from a list

In [None]:
#add a element to a list
colors.append("Black")

In [None]:
print(colors)

In [None]:
#remove an element from a list
colors.pop()

In [None]:
colors # pop removes the last element from the list. Execute previous steps one more time and you will see 'green' will be removed next.

Pop removes the last value. However you can tell pop to remove a specific value if that is what you prefer

In [None]:
colors.pop(1)

In [None]:
print(colors)

<h2>Tuples</h2>

Tuples are similar to lists except they are immutable, meaning that we cannot change, add or remove items after the tuple has been created.
Tuples are also ordered, which means that the items have a defined order, and that order will not change.

In [None]:
# there are two ways to create tuples, with/without parenthesis
positive = "Red","Green","Blue"
negative = ("Cyan","Magenta","Yellow")

In [None]:
type(positive)

In [None]:
print(type(negative)) ## show the type of the collection

negative_one = ("Cyan","Magenta","Yellow")
negative_two =("Magenta","Yellow","Cyan") # create a new tuple, same element, different order
print("Are negative_one and negative_two the same?:", negative_one == negative_two)

When we create a tuple, we normally assign values to it. This is called "packing" a tuple.

To extract the values back into variables, we can "unpack" a tuple

Below is a example of unpacking a tuple

In [None]:
user_data = ("John","Smith","Male","20/02/1981")
first_name,last_name,gender,dob = user_data

In [None]:
print("My name is {0} {1} and I am a {2} and born on {3}".format(first_name,last_name,gender,dob))

<h2>Sets</h2>
Sets are collections with the following properties:

- Unordered, **No duplicated elements**.

- Elements cannot be changed, but can be added

- Duplicated elements are ignored


In [None]:
set_one = {'red','blue','green'}
type(set_one)
print(set_one)

set_two = {'red','blue','green', 'red'} ## duplicated elements are ignored
print(set_two)

print("Set one and set two are the same?:",set_one == set_two)

set_three = {'red','green','blue'}
print("Set one and set three are the same?:",set_one == set_three)


See code below as a sample for a common SET use case, remove duplicated elements from a list

In [None]:
# Sets are sometimes used to remove duplicates from lists
test_list = [1, 5, 3, 6, 3, 5, 6, 1]
print ("The original list is : " +  str(test_list))

# convert list to SET, then back to list
test_list = list(set(test_list))
print ("The list after removing duplicates : " + str(test_list))

<h2>Dictionaries</h2>

A dictionary is an implementation of a data structure that is more generally known as an associative array. A dictionary consists of an unordered collection of key-value pairs. Each key-value pair maps the key to its associated value.

Take a look at this example:


In [None]:
movies = {
    'name': 'Licorice Pizza',
    'year':2021,
    'genre':'Comedy',
    'language':'English',
    'imdb':7.3
}

In [None]:
print(movies)

In [None]:
type(movies) # check the data type of this variable

In [None]:
#Iterating through dictionaries -  one way to do it:
for key,values in movies.items():
  print("The key is {0} and the value is {1}".format(key,values))

In [None]:
#Iterating through dictionaries - Another way to do it:
for key in movies:
    print("The key is {0} and the value is {1}".format(key,movies[key]))

In [None]:
print(movies['name'])

<h4>Deleting an element from a dictionary</h4>
<p>You can delete a value from a dictionary. This helps you remove an element from the dictionary</p>

In [None]:
del movies["imdb"]
print(movies)

<h4>Adding an element to a dictionary</h4>
You can add an element to a dictionary

In [None]:
movies["imdb"] = 8.6
print(movies)

<h4>Combining dictionaries</h4>
<p>You can combine dictionaries together using the following method</p>

In [None]:
cast_info = {'cast':
{
 "Jack Holden":"Sean Penn",
 "Jon Peters":"Bradley Cooper",
 "Rex Blau": "Tom Waits"
}
}

In [None]:
# this is a "cast_info" dictionary with 1 element, key = cast and value = another dictionary with 3 elements: key=character, value = actor/actress
cast_info

In [None]:
movies.update(cast_info)

In [None]:
movies
# within the key: cast, there is "nested" dictionary with the key as the character name, and the value as the actor name.

In [None]:
# another way to list dictionary elements, use the items() method
movies.items()

In [None]:
# a common way to iterate over collection elements, using the FOR loop (next section)
for key_name,value_name in movies.items():
    print(key_name,value_name)

We saw above how we can update a dictionary with the value of another one. But what happens when we enter a dictionary with the same key

<h4>Manipulating dictionaries</h4>

In [None]:
movies_keys = movies.keys()

In [None]:
print(movies_keys)

In [None]:
movies.values()

When we use the .items() method, Python returns a tuple (immutable collection) of elements from the dictionary.

In [None]:
#Making a dict into a list of tuples using the .items() method
movie_tuple = movies.items()
print(movie_tuple)

**Dictionary Challenge**

Let's practice working with dictionary structures

In [None]:
#Using the following dictionary
horror_films = {

    "Directors":["Wes Craven","Alfred Hitchcock","George Romeo"],
    "Film":["Scream","The Birds","Night of the living dead"]
}

#1.) Add a new object in the dictionary called "Lead Actress" with the following values "Neve Campbell",
#   "Tippi Hedren","Judith O'Dea". Think about what type of data type we have seen before to add in the data
#2.) Add the following values for each Key
#    Directors: "Stanley Kubrick"
#    Film: "The Shining"
#    Lead Actress: "Shelley Duvall"

Solution below

In [None]:
#@title
horror_films["Lead Actress"] = ["Neve Campbell","Tippi Hedren","Judith O'Dea"] # Solution 1
horror_films["Directors"].append("Stanley Kubrick")
horror_films["Film"].append("The Shining")
horror_films["Lead Actress"].append("Shelley Duvall")
print(horror_films)

In [None]:
for key_name,value_name in horror_films.items():
    print(key_name,value_name)

**Pause. >>>> Let the instructor know you are at this point. Wait to resume with the next session**


---





#**Section 3: Control Flow**

A program’s control flow is the order in which the program’s code executes.

The control flow of a Python program is regulated by conditional statements, loops, and function calls.It lets us test a condition and then execute a block of code.

Python has three types of control structures:


*   Sequential - default mode.
*   Selection - used for decisions and branching.
*   Repetition - used for looping, i.e., repeating a piece of code multiple times.

Important considerations for the code below:
*   Indentation (to group lines that belong to the "block"
*   the colon symbol (:) to indicate when a block of code starts


In [None]:
letter = 'A'
if letter == 'A':
  print("the letter is A")
print("This line prints always")

In [None]:
# This simple example applies a 5% discount por totals over 250. Check with values 150 and 300

total = 300
if total >= 250:
    total *= 0.95
    print("A 5% discount is applied")
print("Total after discount = £{:.2f}".format(total))

# indentation is very important here !!

In [None]:
## IF/ELSE statement is added
total = 150
if total >= 250:
    total *= 0.95
    print("Total after discount = £{:.2f}".format(total))
else:
    print("Add £{:.2f} to receive a 5% discount".format(250 - total))

In [None]:
## IF/ELSE/ELIF statement is added
total = 250
if total > 300:
    print("More than 300")
elif total < 300 and type(total) == int:
    print("Value is less than 300 and is type integer")
else:
    print("Value is not type integer")

<h2> Looping </h2>
A loop in programming is when you want to execute the same block of code a number of times. This can be done in various ways.
<h3> 'For' Loops </h3>
For loops work by iterating through a set of values. If we use range(min, max) we can generate a set of values from min to max. In the example below, the range is from 0 to 10:

In [None]:
for n in range(0,10):
    print("n is now {0}".format(n))

We could just as easily iterate through a list of values as shown in this example:

In [None]:
for n in [3, 9, 19]:
    print("n is now {0}".format(n))

<h3>'While' Loops</h3>
<p>While loops iterate through elements until the condition is met, in this case when i is equal to 10</p>

In [None]:
i = 0
while i < 10:
    print("i is now {0}".format(i))
    i+=1

<h2>Guess a number challenge</h2>

In this challenge you should write code that generates a random number between 1 and 100 and allows the user to input numbers to guess that number.
After making each guess the user should be told if they are too high or too low and then given another chance to guess again, up to 5 attempts.

Once the user gets the number correct they should be congratulated and told how many attempts they took to get it correct.



In [None]:
# try here to write the code by yourself


In [None]:

#Step 1: Import Random. Random is a python module that will help us generate a random number

import random

secret = random.randint(1,100)
print(secret)
guess_count = 1

guess = int(input("Guess the secret number: "))

while (guess != secret) and (guess_count < 5):
  guess_count += 1
  print(guess_count)
  if guess > secret:
    guess = int(input("Guess is too high, please guess again: "))
  else:
    guess = int(input("Guess is too low, please guess again: "))

if (guess == secret):
  print("Correct! Number of guesses was {0}".format(guess_count))
else:
  print("Out of guesses!")

Additional challenge, optional: give the player only 5 attempts to guess the number.

**Pause. >>>> Let the instructor know you are at this point. Wait to resume with the next session**


---





#**Section 4: Functions**

Functions allow programmers to group a series of commands together to make a procedure that can be run as many times as required, in other words: 'reused'.

In python the keyword "def" is used to define a function, followed by the name of the function, brackets and a colon

In [None]:
def print_name(name):
    print("Hello {0}, how are you today?".format(name))


The way to define a function is by first using the "def" keyword. A function requires a name. This way we can call that function by referring to its name

A function requires the following:

- A name

- Lines of code that will execute the logic of the function

- Brackets ()

- Inside the brackets, zero or more parameters (while input parameters are optional, brackets are always required)


In [None]:
print_name("Michael")

In [None]:
#Lets make a simple fuction that prints out the first letter of each String
def first_letter(value):
    print(value[0])

In [None]:
first_letter("Hello")

<h2>Passing parameters or arguments to a function</h2>

Parameters are like variables that can be used inside that function. They are not mandatory but very useful

Python has lots of prebuilt functions for many purposes like .sort(), input() etc. A list of functions can be found on this link https://docs.python.org/3/library/functions.html

Let's look now at the return statement

In [None]:
def average(a,b):
    x = (a + b) / 2
    return(x)

In [None]:
average(2,3)

As it is possible to send values to a function, so a function can send a value back to where it was called from. This is done by the return() function. Inside the brackets we put the value that is returned.

In the previous example the function calculates the average of two values and returns the result. Run the code below to see what happens when we use or we don't use the return statement in a function.

In [None]:
def average_noreturn(a,b):
    x = (a + b) / 2

In [None]:
return_var = average(2,3)
noreturn_var = average_noreturn(2,3)

In [None]:
print(return_var)

In [None]:
print(noreturn_var)

You can see that if no return value is specified then the function returns None, a special value with the type NoneType

In [None]:
type(noreturn_var)

<h4>Variable Scope</h4>
Variables declared inside a function can only be accessed inside that function. Whereas global variables in the main block of code can be accessed by any function.

In [None]:
m_var = 10 # Declared outside the function global access
answer = 0
def sum_func(val1,val2):
    answer = val1 + val2 * m_var # answer is declared inside the function
    return(answer)

print(sum_func(2,1))
print(answer)

<h3> Referencing </h3>
Unlike most C based languages, arguments passed to a function are passed by reference, meaning that if the function changes the value of a variable passed, it will change the value outside the scope of the function.

In [None]:
# Another example
original_list = [1,2,3]

def update_list(passed_list):
    passed_list.append(4)
    #print(passed_list)
print(original_list)

In [None]:
update_list(original_list)

In [None]:
original_list

<h4>Function coding challenge: Build a function that can search through a list</h4>

<ol>
<li>Use this list: sample_list = ["Francisco", "Arianna", "Daniel", "James"]</li>

<li>Write a function called "find_word" that takes two parameters: (1) the word you are searching for and (2) the list you are going to search through</li>

<li>The function should return the following sentence: "The word {0} is element {1} of the list" and replace the {0} with the word and the {1} with the number of the element on the list</li>

<li>**Hint: Use the index function .index() for example: listname.index(word)</li>

</ol>

In [None]:
# your code here



In [None]:
#@title
def find_word(word, input_list):
  for item in input_list:
    if item == word:
      pos = input_list.index(word)
      return("The word {0} is element {1} of the list".format(word, pos+1))
  return("The word {0} is not in the list".format(word))

In [None]:
sample_list = ["Francisco", "Arianna", "Daniel", "James"]
find_word("Daniel", sample_list)

In [None]:
#Type your code in here or use 'Show code' to display a potential solution in the cell below

In [None]:
search = find_word
search("Arianna", sample_list)

**Pause. >>>> Let the instructor know you are at this point.**

<h1>End of Module 1: Python and Jupyter Notebooks</h1>



---





#**Module 2 - Python Data Libraries**


**Module 2** covers data libraries and packages available in Python, how to install, run data functions and evaluate results. This part introduces some of the most popular and widespread used libraries in data science and analytics processes:

<ul>
    <li>Pandas</li>
    <li>Numpy</li>
    <li>Matplotlib</li>
    <li>Seaborn</li>
</ul>


---
#**Section 1: Preparing Data**

Often you will want to save data that you have processed in some way, either for storage or transmission. There are various modules that you can import to achieve this:
*   Pickle: serialising data. The file that is generated is not human readable and only Python has the the ability to understand the content.
*   CSV: good at storing and transmitting small amounts of tabular information. Very popular format supported for almost any language.
*   JSON: It supports more data structure than CSV and is capable of dealing with large datasets. Human readable and also very popular, almost any language can decode JSON files.

Regardless of the format used, Python uses a similar sintax to encode/decode data:
*   The relevant module is imported
*   Dump is used to encode data
*   Load is used to decode a file


In [None]:
import json

# assign a dictionary to the variable data
data = {
    "Movie": {
        "title": "The Irishman",
        "year": "2019"
    }
}

print("Type of the variable data:", type(data))

# convert the variable to a JSON string
json_string = json.dumps(data, indent=0)

print(json_string) # print value of the json string

print("Type of the variable json_string:", type(json_string))

# store the dict into a json file maintaining format. Check the data_file.json
# file in your colab home directory. This process is called serialization
with open("data_file.json", "w") as write_file:
    json.dump(data, write_file, indent=4)

# de-serializing the data, reading from a json file to a variable in memory
with open("data_file.json", "r") as read_file:
    data_file = json.load(read_file)

print("show json data after load from file:")
print(data_file)


#**Section 2: Reading & Writing data**

<h4>Reading Data</h4>
<p>

The Pandas Module has a number of functions for loading data from a external document directly into an object called a DataFrame. </p>
<p> The DataFrame in Pandas is a very useful object for representing data, like a matrix, spreadsheet, or a database table.

In [None]:
import pandas as pd

<h4>Load DataFrame from CSV files </h4>

The Pandas function 'read_csv' enables you to load a csv document into a DataFrame and only has one required argument of the file name and location. For testing purposes the file "PL Top Scorers2018-04-27.csv" has been saved in the same location as the Python file.

This sample dataset contains statistics about the Football English Premier League season 2021 - 2022

In [None]:
#Read a CSV file into a dataframe
dframe = pd.read_csv("https://storage.googleapis.com/python-resources-387528315161/PL_top_scorers_2021_2022.csv")
# we will use this dframe dataframe for most of the steps below

In [None]:
type(dframe)

In [None]:
#the .head() function will give us the first 5 rows of a dataframe,
# .tail() the bottom lines and .sample() takes rows at random

#dframe.head(15)
#dframe.tail()
dframe.sample(5)

<h4>Load DataFrame from Excel files</h4>

Below another example of reading external files into a Panda's DataFrame. In this case we are working with a MS Excel document (.xlsx). Once the pandas function loads the data in the DataFrame, the manipulation of the content is exactly the same, regardless the source of the data.

The read_excel function has a number of additional arguments (preferences) that can be passed when calling the function and a full list of the arguments can be found in the documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

In [None]:
#Read an Excel file into a dataframe
df_excels = pd.read_excel("https://storage.googleapis.com/python-resources-387528315161/XLSX_PL_Top%20Scorers.xlsx")

In [None]:
df_excels.head()



---


#**Section 3: Transforming and Indexing**

<p>It is possible to create a data frame from a number of data types, as long as there is enough data to be able to represent it in a tabular format.</p>

<h4>Load Data Frame from Dictionary</h4>

<p>A dictionary is the closest in structure to a data frame, it is the most common input to use in creating a data frame. The example below shows the creation of a dictionary and then converting it to a dataframe calling a Pandas function
</p>

In [None]:
dframe_dict = pd.DataFrame({ # converting a dictionary to a 'df' Pandas dataframe

    'Name': ['Salah','Ronaldo','Kane'],
    'Goals': [23,18,17],
    'Assists':[13,3,9],
    'Team':["LIV","MANU","TOT"]
})
dframe_dict.head()

<h4>Load Data Frame from Tuple</h4>
A list of tuples can be used to create a data frame, however you will have to specify the column names separately otherwise numerical values incrementing from 0 will be used, as shown in the example below:

In [None]:
#From a tuple
player_tuples = [('Salah',9,31,'LIV'),
                 ('Kane',2,26,'TOT'),
                 ('Aguero',6,21,'MANC')
                ]

labels = ["Name","Goals","Assists","Team"]

dframe_list = pd.DataFrame(player_tuples)
dframe_list = pd.DataFrame(player_tuples,columns=labels)

In [None]:
dframe_list.head()

<h4>Load Data Frame from Pandas series</h4>

A Panda 'Series' is equivalent to one row of data.
To create a DataFrame from a list of series each series needs to have the index defined individually, as shown below:

In [None]:
#From a series
player_series = [

    pd.Series(["Salah",31,9,"LIV"],index=["Name","Goals","Assists","Team"]),
    pd.Series(["Kane",26,2,"TOT"],index=["Name","Goals","Assists","Team"]),
     pd.Series(["Aguero",21,6,"MANC"],index=["Name","Goals","Assists","Team"])
]
dframe_series = pd.DataFrame(player_series)

In [None]:
dframe_series.head(2)

**Pause. >>>> Let the instructor know you are at this point. Wait to resume with the next session**


---



#**Section 4: Working with Pandas DataFrames**


<p>Once you have a data frame created that are a number of methods of selecting specific sections and then manipulating that data in some way.

For the samples below, we use the DataFrame **dframe** created in previous step.
</p>

**Selecting DataFrame Columns**

<p>It is possible to select a single column in the same way that you would select a single element of a list,as shown below</p>

In [None]:
dframe["Player"]

In [None]:
#Accessing multiple columns
pga = pd.DataFrame(dframe,columns = ["Player","Goals","Assists"])

In [None]:
pga.head()

In [None]:
#Another way to access multiple columns
dframe[["Player","Goals","Assists"]]

<p>Accessing specific data based on logical conditions</p>

In [None]:
dframe[(dframe["Assists"] > 5) & (dframe["Goals"] < 10)]

the function iloc allows to retrieve rows from a data frame using the row numbers

In [None]:
dframe.iloc[0:4]
# first value within the squared brackets define the first element to retrieve, the second value specifies how many

By default an index is created for a DataFrame. But you can set a specific column of the DataFrame as index, if required, using the set_index() function

In [None]:
dframe_loc = dframe.set_index('Player')

In [None]:
dframe_loc.head()

The loc property is used to access a group of rows and columns by label(s) or a boolean array.

.loc[] is primarily label based, but may also be used with a boolean array.

In [None]:
# This command fails, wrong key
dframe_loc.loc["Mohammed Salah"]

# comment line above, remove comment below and run with the new key

#dframe_loc.loc["Mohamed Salah"]

In [None]:
dframe_loc.loc[["Mohamed Salah","Harry Kane"],["Team","Assists"]]

<p>Often you will only want to return just one or a few columns, in which case the Pandas function 'loc' is very useful. It allows you to pass the logical condition and the columns to be outputted. The example demonstrate how you could find the average total shots of any player that has scored more than 10 goals:</p>

In [None]:
dframe.loc[dframe["Goals"] > 10, "Total Shots"].mean()

In the example above the mean was found, however a snap shot of general statistical information can be found using the **describe** function, as shown below:

In [None]:
dframe.loc[dframe["Goals"] > 10, "Total Shots"].describe()

In [None]:
CF_players = dframe[(dframe["Goals"] > 20)]
pd.options.mode.chained_assignment = None ## options: None, warn, raise
CF_players['Pos'] = "RB"
CF_players


**Adding columns**

You can add a new column and the data to be assigned to it using the following code:

In [None]:
dframe["Pos"] = "CF"

In [None]:
dframe.head()

**Removing Columns or Rows**

Using the Pandas function 'drop' you can remove a column or a row from a data frame. The following example shows how to remove the 'Pos' column from the dataframe:

*axis: int or string value, 0 for Rows and 1 for Columns*

In [None]:
dframe.drop(["Pos"],axis=1,inplace = True)

In [None]:
dframe.head()

In [None]:
# how to delete row in data frame by index
dframe.drop(4,axis=0,inplace = True) # deleting row #5 (zero-index in python)
# how to deleterow by value in column
dframe.drop(dframe.index[dframe['Player'] == 'Harry Kane'])


*Challenge*

Assign the POS (player position) to the dframe data frame based on the number of goals scored by the player, following this:

*   more than 20 goals, assign "CF" (center forward)
*   between 12 and 20 goals, assign "MF" (mid fielder)
*   less than 12 goals, assign "D" (defender)

In [None]:
pos_list = []
for goal in dframe["Goals"]:
  # your code here
dframe.head(20)

In [None]:
pos_list = ["CF", "MF", "D"]
for goal in dframe["Goals"]:
  # your code here
  dframe["Goals"].index
  if goal > 20:
     dframe["POS"] = "CF"
  elif goal > 12 and goal <= 20:
     dframe["POS"] = "MF"
  else:
     dframe["POS"] = "D"
dframe.head(5)


In [None]:
#@title
pos_list = []
for goal in dframe["Goals"]:
  if goal > 20:
    pos_list.append("CF")
  elif goal > 12 and goal <= 20:
    pos_list.append("MF")
  else:
    pos_list.append("D")
print(pos_list)
dframe["Pos"] = pos_list
dframe.head()

In [None]:
dframe.head(20)

**Dropping Columns**

The drop() method removes the specified row or column.

By specifying the column axis (axis='columns'), the drop() method removes the specified column.

By specifying the row axis (axis='index'), the drop() method removes the specified row.

In [None]:
dframe.drop(["Pos"],axis=1)

**Merge**

The Pandas function 'merge' allows two DataFrames to be merged into one, a bit like a 'join' in SQL. To demonstrate the uses of this function the following DataFrames have been created:

In [None]:
dframe1 = pd.DataFrame({

    'key1': ['A','B','B','C','C'],
    'key2': ['Y','Y','Z','Z','Z'],
    'series1':range(5)
})

dframe2 = pd.DataFrame({

    'key1': ['B','C','D','D'],
    'key2': ['X','Z','Y','Z'],
    'series2':range(4)
})

In [None]:
dframe1.head()


In [None]:
dframe2.head()

The function will use any columns that have the same name in both of the passed DataFrames to perform the merge on. So if the 'dframe1' and 'dframe2' were merged the columns 'key1' and 'key2' will be used in the merge and the resulting DataFrame output will only include rows were key1 and key2 have the exact same value in both series, as shown below:

In [None]:
pd.merge(dframe1,dframe2)

You can specify which column to perform the merge on by passing an 'on' arguments. The example below is only using 'key1' to perform the merge and as a result the outputted DataFrame has rows when series1 and series2 have the same value for key1:

In [None]:
#Merging on a specific key
pd.merge(dframe1,dframe2,on='key1')

It is also possible to specify what sort of join should be used in the merge by passing either 'inner', 'left','right' or 'outer' in the 'how' argument, as shown below:

In [None]:
pd.merge(dframe1,dframe2,how='left')

**Combine**

The Pandas function 'combine' allows two DataFrames to be merged into one. To demonstrate the uses of this function the following DataFrames have been created:

the 'numpy' module is loaded. NumPy is a Python library used for working with arrays.


In [None]:
import numpy as np

In [None]:

nan = np.nan

In [None]:
dframe1 = pd.DataFrame({

    'A':[10,11,nan,12,13],
    'B':[14,nan,nan,15,16]

})

dframe2 = pd.DataFrame({

    'A':[21,nan,22,nan],
    'B':[23,24,nan,nan],
    'C':[nan,25,26,27]
})

In [None]:
dframe1

In [None]:
dframe2

The combine_first function will use the values in dframe1 to output a new data frame, where the dframe1 has null values, the value of dframe2 will be used.

Notice that the output has same number of columns and rows that the max of either of the input data frames have.

In [None]:
dframe1.combine_first(dframe2)

**Concatenate**

The Pandas function 'concat' allows multiple DataFrames to be joined by putting them next to one another. To demonstrate the uses of this function the following DataFrames have been created:

In [None]:
dframe1 = pd.DataFrame({
    'A': ['A1','A2','A3','A4','A5'],
    'B': ['B1','B2','B3','B4','B5']
})

dframe2 = pd.DataFrame({
    'A': ['A6','A7','A8','A9'],
    'B': ['B6','B7','B8','B9'],
    'C': ['C6','C7','C8','C9']
})

In [None]:
dframe1

In [None]:
dframe2

Notice how null values are used when the data frame dimensions don't match up.

In [None]:
frames = [dframe1,dframe2]
pd.concat(frames,ignore_index=True)

Let's use a different axis to concatenate and see and compare the results:

In [None]:
pd.concat(frames, axis=1)

NaN or empty cells can be replaced by assigned values assigned using the 'fillna' function

In [None]:
frames_df = pd.concat(frames,ignore_index=True)

In [None]:
frames_df.head(10)

In [None]:
frames_df.fillna("C2")

**Pause. >>>> Let the instructor know you are at this point. Wait to resume with the next session**


---



#**Section 5: Data Visualization with Libraries**

<p>Python gives us options for visualizations. This allows us to present the results of our data analysis and gain further insights</p>
<p>Seaborn and Matplotlib are both powerful libraries for this purpose<p>

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
help(plt) # check available functions for matplotlib.pyplot

<h3>Load a number of datasets for plotting</h3>

Iris: The Iris (flowers) Dataset contains four features of 50 samples of three species of Iris, is often used in data mining examples and to test algorithms

In [None]:
ds_1 = np.random.randn(100)
ds_1

In [None]:
ds_1 = np.random.randn(100)
ds_2 = np.random.randn(100)

pl_df = pd.read_csv(folder_path + "PL_top_scorers_2021_2022.csv")
i_df = sns.load_dataset("iris")
f_df = sns.load_dataset("flights")

Let's sample the data frames

In [None]:
ds_1[0:14]

In [None]:
ds_2[0:14]

In [None]:
pl_df.head(15)

In [None]:
i_df.head(15)

In [None]:
f_df.head(15)

First plotting sample using the pyplot module.

The matplotlib API in Python provides the **barh()** function which can be used in MATLAB style use or as an object-oriented API.

In [None]:
plt.barh(pl_df["Player"].head(),pl_df["Goals"].head())

In [None]:
%matplotlib inline
plt.barh(pl_df["Player"].head(),pl_df["Goals"].head(),color='green',edgecolor='black',height=1)

<h3>Histograms</h3>

Histograms are used to show distributions of variables. Plot quantitative data with ranges of the data grouped into bins or intervals

In [None]:
plt.hist(ds_1)

In [None]:
plt.hist(ds_1,alpha=0.3,color='red') #Plots the histogram
sns.rugplot(ds_1) #Shows the blue ticks at the bottom

#seaborn rugplot function complements other plots by showing the location of
#individual observations in an unobstrusive way.

<h3>Regression Plots</h3>

Regression plots in seaborn are intended to add a visual guide that helps to emphasize patterns in a dataset. They create a regression line between 2 parameters and helps to visualize their linear relationships.

In [None]:
#sns.regplot('Goals','Assists',pl_df)
#sns.regplot(pl_df)
sns.regplot(data=pl_df, x='Goals', y='Total Shots')


In [None]:
pl_df = pd.read_csv(folder_path + "PL_top_scorers_2021_2022.csv")

In order to have a better analysis capability using these plots, we can specify 'hue' to have a categorical separation in our plot as well as use markers that come from the matplotlib marker symbols.

In [None]:
#sns.lmplot(data=pl_df, x='Goals', y='Total Shots',hue='Pos',markers = ['x','o'])
sns.lmplot(data=pl_df, x='Goals', y='Total Shots',hue='Pos',markers = ['x','o','*'])

Kdeplot depicts the probability density function of the continuous or non-parametric data variables i.e. we can plot for the univariate or multiple variables altogether.

In [None]:
sns.kdeplot(ds_1)

Seaborn distplot lets you show a histogram with a line on it. This can be shown in all kinds of variations.
A distplot plots a univariate distribution of observations. The distplot() function combines the matplotlib hist function with the seaborn kdeplot() and rugplot() functions.

*the distplot() function is deprecated. Its features have been subsumed by displot() and histplot()*

In [None]:
sns.distplot(ds_1)

In [None]:
sns.displot(ds_1)

The pivot() function is used to 'reshape' a given DataFrame organized by given index / column values.

In [None]:
flights = f_df.pivot('month','year','passengers')

In [None]:
flights.head()

Then we can use the heatmap function in Seaborn to create a heat map of this new formatted data frame, used for correlation and other types of analysis

In [None]:
sns.heatmap(flights)

In [None]:
f, ax = plt.subplots(figsize=(9, 6))
#ax
sns.heatmap(flights, annot=True, fmt="d", linewidths=.5, ax=ax)

<h3>Correlation Analysis</h3>

Using the jointplot function in Seaborn. The parameter kind defines the type of visualization : kind{ “scatter” | “kde” | “hist” | “hex” | “reg” | “resid” }

In [None]:
sns.jointplot(ds_1,ds_2,kind='kde')

**Pause. >>>> Let the instructor know you are at this point.**

<h1>End of Module 2: Python Data Liraries</h1>