<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/present/blob/master/WUSTL/PythonTutorial/python_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Introduction to Python

[Jeff Heaton](http://www.jeffheaton.com)

Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.  Python has become a common language for machine learning research and is the primary language for TensorFlow. 

Like most tutorials, we will begin by printing Hello World.

In [1]:
print("Hello World")

Hello World


The above code passes a constant string, containing the text "hello world" to a function that is named print.  

You can also leave comments in your code to explain what you are doing.  Comments can begin anywhere in a line.

In [2]:
# Single line comment (this has no effect on your program)
print("Hello World") # Say hello

Hello World


Like many languages Python uses single (') and double (") quotes interchangeably to denote literal string constants. The following code makes use of a single quote.

In [3]:
print('Hello World')

Hello World


In addition to strings, Python allows numbers as literal constants in programs. Python includes support for floating-point, integer, complex, and other types of numbers. This course will not make use of complex numbers. Unlike strings, quotes do not enclose numbers.

The presence of a decimal point differentiates floating-point and integer numbers. For example, the value 42 is an integer. Similarly, 42.5 is a floating-point number. If you wish to have a floating-point number, without a fraction part, you should specify a zero fraction. The value 42.0 is a floating-point number, although it has no fractional part. As an example, the following code prints two numbers.

In [4]:
print(42)
print(42.5)

42
42.5


So far, we have only seen how to define literal numeric and string values. These literal values are constant and do not change as your program runs. Variables allow your program to hold values that can change as the program runs. Variables have names that allow you to reference their values. The following code assigns an integer value to a variable named "a" and a string value to a variable named "b."

In [5]:
a = 10
b = "ten"
print(a)
print(b)

10
ten


The key feature of variables is that they can change. The following code demonstrates how to change the values held by variables.

In [6]:
a = 10
print(a)
a = a + 1
print(a)

10
11


You can mix strings and variables for printing. This technique is called a formatted or interpolated string. The variables must be inside of the curly braces. In Python, this type of string is generally called an f-string. The f-string is denoted by placing an "f" just in front of the opening single or double quote that begins the string. The following code demonstrates the use of an f-string to mix several variables with a literal string.

In [7]:
a = 10
print(f'The value of a is {a}')

The value of a is 10


You can also use f-strings with math (called an expression). Curly braces can enclose any valid Python expression for printing. The following code demonstrates the use of an expression inside of the curly braces of an f-string.

In [8]:
a = 10
print(f'The value of a plus 5 is {a+5}')

The value of a plus 5 is 15


Python has many ways to print numbers; these are all correct. However, for this course, we will use f-strings. The following code demonstrates some of the varied methods of printing numbers in Python.

In [9]:
a = 5

print(f'a is {a}') # Preferred method for this course.
print('a is {}'.format(a))
print('a is ' + str(a))
print('a is %d' % (a))

a is 5
a is 5
a is 5
a is 5


You can use if-statements to perform logic. Notice the indents? These if-statements are how Python defines blocks of code to execute together. A block usually begins after a colon and includes any lines at the same level of indent. Unlike many other programming languages, Python uses whitespace to define blocks of code. The fact that whitespace is significant to the meaning of program code is a frequent source of annoyance for new programmers of Python. Tabs and spaces are both used to define the scope in a Python program. Mixing both spaces and tabs in the same program is not recommended.

In [10]:
a = 5
if a>5:
    print('The variable a is greater than 5.')
else:
    print('The variable a is not greater than 5')

The variable a is not greater than 5


The following if-statement has multiple levels. It can be easy to indent these levels improperly, so be careful. This code contains a nested if-statement under the first "a==5" if-statement. Only if a is equal to 5 will the nested "b==6" if-statement be executed. Also, note that the "elif" command means "else if."

In [11]:
a = 5
b = 6

if a==5:
    print('The variable a is 5')
    if b==6:
        print('The variable b is also 6')
elif a==6:
    print('The variable a is 6')

The variable a is 5
The variable b is also 6



It is also important to note that the double equal ("==") operator is used to test the equality of two expressions.  The single equal ("=") operator is only used to assign values to variables in Python.  The greater than (">"), less than ("<"), greater than or equal (">="), less than or equal ("<=") all perform as would generally be accepted.  Testing for inequality is performed with the not equal ("!=") operator. 

It is common in programming languages to loop over a range of numbers.  Python accomplishes this through the use of the **range** operation.  Here you can see a **for** loop and a **range** operation that causes the program to loop between 1 and 3.

In [12]:
for x in range(1, 3):  # If you ever see xrange, you are in Python 2
    print(x) 

1
2


### Lists and Tuples

For a Python program, lists and tuples are very similar. Both lists and tuples hold an ordered collection of items. It is possible to get by as a programmer using only lists and ignoring tuples.

The primary difference that you will see syntactically is that a list is enclosed by square braces [], and a tuple is enclosed by parenthesis (). The following code defines both list and tuple.

In [13]:
l = ['a', 'b', 'c', 'd']
t = ('a', 'b', 'c', 'd')

print(l)
print(t)

['a', 'b', 'c', 'd']
('a', 'b', 'c', 'd')


The primary difference you will see programmatically is that a list is mutable, which means the program can change it. A tuple is immutable, which means the program cannot change it. The following code demonstrates that the program can change a list. This code also illustrates that Python indexes lists starting at element 0. Accessing element one modifies the second element in the collection. One advantage of tuples over lists is that tuples are generally slightly faster to iterate over than lists.

In [14]:
l[1] = 'changed'
#t[1] = 'changed' # This would result in an error

print(l)

['a', 'changed', 'c', 'd']


Like many languages, Python has a for-each statement.  This statement allows you to loop over every element in a collection, such as a list or a tuple.

In [15]:
# Iterate over a collection.
for s in l:
    print(s)

a
changed
c
d


The **enumerate** function is useful for enumerating over a collection and having access to the index of the element that we are currently on.

In [16]:
# Iterate over a collection, and know where your index.  (Python is zero-based!)
for i,l in enumerate(l):
    print(f"{i}:{l}")

0:a
1:changed
2:c
3:d


A **list** can have multiple objects added, such as strings.  Duplicate values are allowed.  **Tuples** do not allow the program to add additional objects after definition.

In [17]:
# Manually add items, lists allow duplicates
c = []
c.append('a')
c.append('b')
c.append('c')
c.append('c')
print(c)

['a', 'b', 'c', 'c']


Ordered collections, such as lists and tuples, allow you to access an element by its index number, as done in the following code. Unordered collections, such as dictionaries and sets, do not allow the program to access them in this way.

In [18]:
print(c[1])

b


A **list** can have multiple objects added, such as strings. Duplicate values are allowed. Tuples do not allow the program to add additional objects after definition. The programmer must specify an index for the insert function, an index. These operations are not allowed for tuples because they would result in a change.

In [19]:
# Insert
c = ['a', 'b', 'c']
c.insert(0, 'a0')
print(c)
# Remove
c.remove('b')
print(c)
# Remove at index
del c[0]
print(c)

['a0', 'a', 'b', 'c']
['a0', 'a', 'c']
['a', 'c']


### Sets
A Python **set** holds an unordered collection of objects, but sets do *not* allow duplicates.  If a program adds a duplicate item to a set, only one copy of each item remains in the collection.  Adding a duplicate item to a set does not result in an error.   Any of the following techniques will define a set.

In [20]:
s = set()
s = { 'a', 'b', 'c'}
s = set(['a', 'b', 'c'])
print(s)

{'a', 'c', 'b'}


A **list** is always enclosed in square braces [], a **tuple** in parenthesis (), and similarly a **set** is enclosed in curly braces.  Programs can add items to a **set** as they run.  Programs can dynamically add items to a **set** with the **add** function.  It is important to note that the **append** function adds items to lists, whereas the **add** function adds items to a **set**.  

In [21]:
# Manually add items, sets do not allow duplicates
# Sets add, lists append.  I find this annoying.
c = set()
c.add('a')
c.add('b')
c.add('c')
c.add('c')
print(c)

{'a', 'c', 'b'}


### Maps/Dictionaries/Hash Tables

Many programming languages include the concept of a map, dictionary, or hash table.  These are all very related concepts.  Python provides a dictionary that is essentially a collection of name-value pairs.  Programs define dictionaries using curly braces, as seen here.

In [22]:
d = {'name': "Jeff", 'address':"123 Main"}
print(d)
print(d['name'])

if 'name' in d:
    print("Name is defined")

if 'age' in d:
    print("age defined")
else:
    print("age undefined")

{'name': 'Jeff', 'address': '123 Main'}
Jeff
Name is defined
age undefined


Be careful that you do not attempt to access an undefined key, as this will result in an error. You can check to see if a key is defined, as demonstrated above. You can also access the dictionary and provide a default value, as the following code demonstrates.

In [23]:
d.get('unknown_key', 'default')

'default'

You can also access the individual keys and values of a dictionary.

In [24]:
d = {'name': "Jeff", 'address':"123 Main"}
# All of the keys
print(f"Key: {d.keys()}")

# All of the values
print(f"Values: {d.values()}")

Key: dict_keys(['name', 'address'])
Values: dict_values(['Jeff', '123 Main'])


Dictionaries and lists can be combined. This syntax is closely related to [JSON](https://en.wikipedia.org/wiki/JSON).  Dictionaries and lists together are a good way to build very complex data structures.  While Python allows quotes (") and apostrophe (') for strings, JSON only allows double-quotes ("). We will cover JSON in much greater detail later in this tutorial.

The following code shows a hybrid usage of dictionaries and lists.  

In [25]:
# Python list & map structures
customers = [
    {"name": "Jeff & Tracy Heaton", "pets": ["Wynton", "Cricket", 
        "Hickory"]},
    {"name": "John Smith", "pets": ["rover"]},
    {"name": "Jane Doe"}
]

print(customers)

for customer in customers:
    print(f"{customer['name']}:{customer.get('pets', 'no pets')}")

[{'name': 'Jeff & Tracy Heaton', 'pets': ['Wynton', 'Cricket', 'Hickory']}, {'name': 'John Smith', 'pets': ['rover']}, {'name': 'Jane Doe'}]
Jeff & Tracy Heaton:['Wynton', 'Cricket', 'Hickory']
John Smith:['rover']
Jane Doe:no pets


The variable customers is a list that holds three dictionaries that represent customers. You can think of these dictionaries as records in a table. The fields in these individual records are the keys of the dictionary. Here the keys name and pets are fields. However, the field pets holds a list of pet names. There is no limit to how deep you might choose to nest lists and maps. It is also possible to nest a map inside of a map or a list inside of another list.

### An Introduction to JSON

Data stored in a CSV file must be flat; it must fit into rows and columns. Most people refer to this type of data as structured or tabular. This data is tabular because the number of columns is the same for every row. Individual rows may be missing a value for a column; however, these rows still have the same columns.  

This data is convenient for machine learning because most models, such as neural networks, also expect incoming data to be of fixed dimensions. Real-world information is not always so tabular. Consider if the rows represent customers. These people might have multiple phone numbers and addresses. How would you describe such data using a fixed number of columns? It would be useful to have a list of these courses in each row that can be variable length for each row or student.

JavaScript Object Notation (JSON) is a standard file format that stores data in a hierarchical format similar to eXtensible Markup Language (XML). JSON is nothing more than a hierarchy of lists and dictionaries. Programmers refer to this sort of data as semi-structured data or hierarchical data. The following is a sample JSON file.

```
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    },
    {
      "type": "mobile",
      "number": "123 456-7890"
    }
  ],
  "children": [],
  "spouse": null
}
```

The above file may look somewhat like Python code.  You can see curly braces that define dictionaries and square brackets that define lists.  JSON does require there to be a single root element.  A list or dictionary can fulfill this role.  JSON requires double-quotes to enclose strings and names.  Single quotes are not allowed in JSON.

JSON files are always legal JavaScript syntax.  JSON is also generally valid as Python code, as demonstrated by the following Python program.

In [26]:
jsonHardCoded = {
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": True,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    },
    {
      "type": "mobile",
      "number": "123 456-7890"
    }
  ],
  "children": [],
  "spouse": None
}

Generally, it is better to read JSON from files, strings, or the Internet than hard coding, as demonstrated here.  However, for internal data structures, sometimes such hard-coding can be useful.

Python contains support for JSON.  When a Python program loads a JSON  the root list or dictionary is returned, as demonstrated by the following code.

In [27]:
import json

json_string = '{"first":"Jeff","last":"Heaton"}'
obj = json.loads(json_string)
print(f"First name: {obj['first']}")
print(f"Last name: {obj['last']}")

First name: Jeff
Last name: Heaton


Python programs can also load JSON from a file or URL.

In [28]:
import requests

r = requests.get("https://raw.githubusercontent.com/jeffheaton/"
                 +"t81_558_deep_learning/master/person.json")
print(r.json())

{'firstName': 'John', 'lastName': 'Smith', 'isAlive': True, 'age': 27, 'address': {'streetAddress': '21 2nd Street', 'city': 'New York', 'state': 'NY', 'postalCode': '10021-3100'}, 'phoneNumbers': [{'type': 'home', 'number': '212 555-1234'}, {'type': 'office', 'number': '646 555-4567'}, {'type': 'mobile', 'number': '123 456-7890'}], 'children': [], 'spouse': None}


Python programs can easily generate JSON strings from Python objects of dictionaries and lists.

In [29]:
python_obj = {"first":"Jeff","last":"Heaton"}
print(json.dumps(python_obj))

{"first": "Jeff", "last": "Heaton"}


A data scientist will generally encounter JSON when they access web services to get their data. A data scientist might use the techniques presented in this section to convert the semi-structured JSON data into tabular data for the program to use with a model such as a neural network.

### Reading Files
Python programs can read CSV files with Pandas. The general format of Pandas is:


In [30]:
import pandas as pd

df = pd.read_csv("https://data.heatonresearch.com/data/t81-558/auto-mpg.csv",
                 na_values=['NA', '?'])

The above command loads a classic dataset of cars from the Internet.  It might take a few seconds to load, so it is good to keep the loading code in a separate Jupyter notebook cell so that you do not have to reload it as you test your program.  You can load Internet data, local hard drive, and Google Drive data this way.

Now that the data is loaded, you can display the first five rows with this command.

In [31]:
display(df[0:5])

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


We can query this dataset. To query only the cars with a MPG greater than 18, you can use this command.

In [32]:
df.mpg>=18

0       True
1      False
2       True
3      False
4      False
       ...  
393     True
394     True
395     True
396     True
397     True
Name: mpg, Length: 398, dtype: bool

Notice that we have an array of true/false values equal in length to the dataset. Each row that passes the query has a value of true. We use the following command to see the cars that match the query.

In [33]:
df[df.mpg>=23]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
14,24.0,4,113.0,95.0,2372,15.0,70,3,toyota corona mark ii
18,27.0,4,97.0,88.0,2130,14.5,70,3,datsun pl510
19,26.0,4,97.0,46.0,1835,20.5,70,2,volkswagen 1131 deluxe sedan
20,25.0,4,110.0,87.0,2672,17.5,70,2,peugeot 504
21,24.0,4,107.0,90.0,2430,14.5,70,2,audi 100 ls
...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790,15.6,82,1,ford mustang gl
394,44.0,4,97.0,52.0,2130,24.6,82,2,vw pickup
395,32.0,4,135.0,84.0,2295,11.6,82,1,dodge rampage
396,28.0,4,120.0,79.0,2625,18.6,82,1,ford ranger


Parenthesis becomes very important if you wish to have two parts to the query. You must use the binary operator & for "and". You must also place parenthesis around the two comparisons.

In [34]:
df[ (df.mpg>=23) & (df.mpg<=24) ]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
14,24.0,4,113.0,95.0,2372,15.0,70,3,toyota corona mark ii
21,24.0,4,107.0,90.0,2430,14.5,70,2,audi 100 ls
49,23.0,4,122.0,86.0,2220,14.0,71,1,mercury capri 2000
57,24.0,4,113.0,95.0,2278,15.5,72,3,toyota corona hardtop
59,23.0,4,97.0,54.0,2254,23.5,72,2,volkswagen type 3
82,23.0,4,120.0,97.0,2506,14.5,72,3,toyouta corona mark ii (sw)
101,23.0,6,198.0,95.0,2904,16.0,73,1,plymouth duster
118,24.0,4,116.0,75.0,2158,15.5,73,2,opel manta
122,24.0,4,121.0,110.0,2660,14.0,73,2,saab 99le
147,24.0,4,90.0,75.0,2108,15.5,74,2,fiat 128


# Transfering Files

At some point, you may wish to download generated data or upload your data to CoLab. There are several ways to accomplish this transfer:

* Map Google Drive to CoLab
* Generate your Data and Upload/Download

### Mapping Google Drive

The following code maps your Google drive to CoLab. This operation is performed only for this session of CoLab. You must rerun this code when you leave CoLab and return to it.

In [35]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


You can see the contents of your drive with the !ls command. You can run any UNIX command with the ! prefix. The UNIX ls command lists a directory.

In [36]:
!ls /content/drive/MyDrive

 bin  'Colab Notebooks'   mergelife   share     test.csv
 cfg   data		  projects    t81_558   yolo


We could use the following code to query an MPG range and save those cars to a CSV file on our Google Drive.

In [37]:
df2 = df[ (df.mpg>=23) & (df.mpg<=24) ]
df2.to_csv("/content/drive/MyDrive/test.csv")

Similarly, we could load a CSV from the Google Drive with the following command.

In [38]:
df3 = pd.read_csv("/content/drive/MyDrive/test.csv",
                 na_values=['NA', '?'])

### Uploading and Downloading Files

Your Google CoLab instance contains a temporary drive. Anything you store here will be deleted once you stop using CoLab. However, it can be a great temporary area to upload and download to/from. Your temporary drive is mapped to /content. 

You can use the UNIX wget command to download a URL address. The following code downloads the Auto MPG dataset to your temporary drive.

In [39]:
!wget https://data.heatonresearch.com/data/t81-558/auto-mpg.csv

--2022-09-27 14:45:21--  https://data.heatonresearch.com/data/t81-558/auto-mpg.csv
Resolving data.heatonresearch.com (data.heatonresearch.com)... 108.156.83.38, 108.156.83.80, 108.156.83.90, ...
Connecting to data.heatonresearch.com (data.heatonresearch.com)|108.156.83.38|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18121 (18K) [text/csv]
Saving to: ‘auto-mpg.csv.1’


2022-09-27 14:45:21 (4.93 MB/s) - ‘auto-mpg.csv.1’ saved [18121/18121]



The following command lists your content area, showing you the file you just downloaded.

In [40]:
!ls /content

auto-mpg.csv  auto-mpg.csv.1  drive  sample_data


We can generate a query and save the CSV results to our temporary drive.

In [41]:
df2 = df[ (df.mpg>=23) & (df.mpg<=24) ]
df2.to_csv("test.csv")

Listing the files again shows the file we just generated.

In [42]:
!ls /content

auto-mpg.csv  auto-mpg.csv.1  drive  sample_data  test.csv


We can download files directly from the temporary drive.

In [43]:
from google.colab import files
files.download("/content/test.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

We can also upload files to the temporary drive.

In [44]:
import os
from google.colab import files

uploaded = files.upload()

for k, v in uploaded.items():
  _, ext = os.path.splitext(k)
  print(f"You uploaded /content/{k}")

Saving iris.csv to iris.csv
You uploaded /content/iris.csv


Listing the /content area shows us what we just uploaded.

In [45]:
!ls /content

auto-mpg.csv  auto-mpg.csv.1  drive  iris.csv  sample_data  test.csv
