# MSDS 631
## Lecture 1 (January 23, 2019) - Data Types, Data Structures, and Arithmetic Operators
---

Everybody's first code is <em>always</em> "Hello World"

In [3]:
print('Hello World')

Hello World


## Python uses many of the basic arithmetic commands similar to excel

Addition uses a basic plus sign

In [4]:
1+1

2

Multiplication uses an asterisk

In [1]:
7*3

21

Division uses a single forward slash

In [6]:
7/3

2.3333333333333335

Integer division uses two forward slashes. Integer division is used when wanting to know how many times the divisor goes into the dividend fully. For the example below, the number 3 goes into the number 7 twice fully.

In [7]:
7//3

2

The modulo operator gives the remainder when dividing the dividend by the divisor. In the example below, 3 goes into the number 7 twice, with a remainder of 1.

In [9]:
7%3

1

Sometimes we don't want to run certain sections of code. In the event of this, you will want to "comment out" things. Python will ignore everything in a row after a pound sign (aka hashtag, number sign). To comment out multiple lines, you highlight the lines you want to comment out and use the shortcut for the editor you are using (it is generally CMD-/ (for Mac) or CTRL-/ (for Windows).

In [3]:
#I don't want Python to recognize this line
print('Hello World') #The comment does not have to happen at the beginning of a row
#It can happen anywhere. Python will ignore everything in a line after the pound sign e.g. x = 1+1

Hello World


Exponents are <strong>not</strong> represented by the "caret" symbol (^). In Python, exponents are represented by two asterisks.

In [4]:
3**3

27

In [5]:
# x = 1
# y = 2
# z = 3
print('Hello World')

Hello World


In [7]:
# Trying to print x will trigger an error message because Python was never told what "x" was supposed to be.
print(x)

NameError: name 'x' is not defined

Python uses the same order of operations as in basic arithmetic. The basic order of precedence is:
1. Parentheses
2. Exponentiation
3. Multiplication, Division, Floor division, Modulus
4. Addition and Subtraction

In [18]:
1 + 2 * 3

7

In [8]:
(1 + 2) * 3

9

In [9]:
1 + 2 * 3 ** 2

19

In [10]:
(1 + 2) * (3 ** 2)

27

## Variables
Variables are a way to store values or procedures so that your code is easier to implement and read.

In [63]:
x = 123
y = 321
z = x / y

In [65]:
# Note that when we assign the values to a variable, Python does not send anything to the console
# to print (like we saw in the arithmetic above). In order to see the values assigned to the variables
# we need to explicitly display them with a "print" command.
print(x)
print(y)
print(z)

123
321
0.38317757009345793


Variables can have almost any structure as long as they adhere to the following rules:
- They must not begin with a letter (numbers can appear anywhere else though)
- They must not contain any of the following characters:
 - Spaces
 - Special characters, including almost all punctuation. The one exception is the underscore (e.g. my_variable)
- They must not conflict with any of Python's "reserved" keywords. These are functions that exist in native Python, such as print, id, open, and with. A full list can be found here: https://docs.python.org/2.5/ref/keywords.html

We try to adhere to several rules when choosing variable names:
- You should balance descriptiveness with efficiency. Too short usually means too vague and too long is hard to read
- Use only lowercase letters. All CAPS should only be used for defining constants and global variables. More on this later.
- You should separate words of your variable name with underscores. There is another version called "Camel Case" (e.g. `myVariable` or `yourFirstVariable`) where you use occasional uppercase lettering to delineate words, but this is not a Python standards

Lastly, we should be pragmatic about what actually *needs* a variable assignment. We could break any formula into every step possible and assign variables at every point along the way, but this can be cumbersome and time consuming. It also makes your code difficult to read. You also don't want to do all of your work in a single line of code. These "intermediate variables" are not inputs nor outputs that you really need, but they help you keep track of things along the way.

In [90]:
# Let's assume we have the following variables and we want to compute the discretionary budget for this household
household_income = 100000
tax_rate = .08
fixed_expense_rate = .30 #This is a percentage of post-tax income spent on necessities
fixed_savings_rate = .25 #This is a percentage of post-tax AND post-expense money set aside for savings

In [91]:
# Here we can compute the discretionary spending budget in a single line
discretionary_spending1 = (household_income * (1 - tax_rate)) * (1 - fixed_expense_rate) * (1 - fixed_savings_rate)
print(discretionary_spending1)

48299.99999999999


In [92]:
# The above formula is confusing and hard to know if we actually computed things properly.
# Below we will use intermediate variables to achieve the same result
post_tax_income = household_income * (1 - tax_rate)
post_tax_and_expense_income = post_tax_income * (1 - fixed_expense_rate)
discretionary_spending2 = post_tax_and_expense_income * (1 - fixed_savings_rate)
print(discretionary_spending2)

48299.99999999999


In [93]:
# Here is another way to compute the exact same thing using different intermediate variables
taxes = household_income * tax_rate
expenses = (household_income - taxes) * fixed_expense_rate
savings = (household_income - taxes - expenses) * fixed_savings_rate
discretionary_spending3 = household_income - taxes - expenses - savings
print(discretionary_spending3)

48300.0


Do you see the floating point error above?

In [100]:
print(discretionary_spending1)
print(discretionary_spending2)
print(discretionary_spending3)
discretionary_spending1 == discretionary_spending3

48299.99999999999
48299.99999999999
48300.0


False

In [97]:
discretionary_spending2 == discretionary_spending3

False

In [98]:
discretionary_spending1 == discretionary_spending2

True

Depending on how you are using numbers, this could be an issue. For example, if you are managing a bank ledger and have billions of transactions per day, then these will absolutely add up over time. In the vast majority of use cases, however, this will not be a big deal. Simply be aware that this is sometimes an issue.

## Data Types
There are three primary data types:
- Integers
- Floats
- Strings

There are a few other data types that we will deal with in the future (e.g. Booleans and NoneType) and a few that we will not cover in this class (e.g. imaginary numbers)

There will be key distinctions that we will need to make between integers and floats. First, integers obviously have no decimal component. Even so, in the eyes of Python there is a difference between 3 and 3.0. In order to perform mathematical operations, Python first ensures that all data is of the same type. Therefore, when trying to divide 9 by 3.3, Python will automatically "cast" 9 as 9.0 to perform the operation. In this case, Python determined that the "lowest common denominator" data type was float.<br>

When trying to guess what the "best" common data type should be, think about the following example. If I have the integer 9 and the float 3.3, it is "more correct" to convert 9 to 9.0 as a float than it is to convert 3.3 into 3 as an integer.<br>

As we move to discussing Pandas, the data types will become more important.

In [28]:
#You can find out the type of a value by calling the "type" method
print(type(3.0))
print(type(3))
print(type('abc'))

<class 'float'>
<class 'int'>
<class 'str'>


In [23]:
#Python will try to automatically convert any non-float values into floats and return a float value as well
#Python will always automatically converts values into float to perform regular division.
#For integer division, Python will convert values into float if at least one value in the equation is a float
#For integer division, Python will not cast any values (as floats) if values start out as integers
print(9 / 3.0)
print(9 / 3)
print(9.0 // 3)
print(9 // 3)

3.0
3.0
3.0
3


It's important to note, however, that floats can sometimes behave very oddly in Python. Computers work in binary formats, and that sometimes means representing numbers as approximations (albeit very very very close approximations). In the _<strong>vast</strong>_ majority of cases you will never see this difference. Only in extreme situations will you run into issues like we see below.

In [24]:
#In this case, 3.675 is being rounded DOWN instead of up even though the number being rounded is a "5"
round(3.675, 2)

3.67

In [26]:
#In this case, adding two float values results in an imprecise outcome
print(.1+.2)
print(.1+.2==.3) # Checking to see if 0.1 + 0.2 equals 0.3

0.30000000000000004
False


Strings are exactly as you would assume they are - typical characters and numbers associated with typing. In the original `print("Hello World")` command you wrote, "Hello World" is considered a string value. You can use single quotes or double quotes in Python to define a string, but other languages do not allow you to interchange them so liberally. This is a matter of personal preference. For instance, you will occasionally want to store a string that contains an apostrophe. In this case, it is most convenient to just choose double quotes to encapsulate your string and use your apostrophe inside of that string. (e.g. "This is Jason's Computer")

In [32]:
print('Hello' == "Hello") #Checking to see if defining "Hello" with single or double quotes makese a difference
print("This is Jason's Computer")

True
This is Jason's Computer


In [33]:
# You can also perform add strings together, but be sure to account for spaces since they are not automatically added.
"Hello" + "World"

'HelloWorld'

In [34]:
# You can also multiply letters and they will repeat
'A' * 4

'AAAA'

In [36]:
# You cannot subtract or divide letters though
'AAAA' - 'AAA' #Does not equal 'A'

TypeError: unsupported operand type(s) for -: 'str' and 'str'

Everything in Python is case sensitive, so be mindful when typing. You will almost certainly run into a case issue at some point.

In [138]:
'Jason' == 'JAson'

False

In [139]:
# In many cases you will want to write multiple lines of text for a single string value.
# In this case, you should use three quotes to set the opening and closing of your multi-line text.
# The quotes used to open and close the multi-line string can be single or double quotes. Note that when you use the
# triple quotes, you are allowed to use any quotes or apostrophes and it' won't try to "close" the string.
"""Hi, my name is Jason. I work in a button factory. One day, my bass came to me and said, "Jason, 
   are you busy?" I said "no." 
   THE END"""

'Hi, my name is Jason. I work in a button factory. One day, my bass came to me and said, "Jason, \n   are you busy?" I said "no." \n   THE END'

In the text resulting from my string, you'll see several strange characters (namely the \n). These are special characters that only show up when new lines (aka line returns) show up in raw string values. If I were to print this same sentence, then the interpreter would treat the newline character as a visual new line. We will talk about this distinction later in the class.

In [60]:
phrase = """Hi, my name is Jason. I work in a button factory. One day, my bass came to me and said, "Jason, 
are you busy?" I said "no." 
THE END"""
print(phrase)

Hi, my name is Jason. I work in a button factory. One day, my bass came to me and said, "Jason, 
are you busy?" I said "no." 
THE END


Python also has many ways to auto-insert values into strings to make them more dynamic. The easiest way to do this is via string "formatting." There are a few ways to format strings.

In [61]:
#Ordered fill values. Note that I used double quotes because I had an apostrophe in the sentence
"Hello my name is {}. My mother's name is {}".format('Jason', 'Margaret') #Values are separated by commas

"Hello my name is Jason. My mother's name is Margaret"

In [140]:
#Aliased fill values.
phrase = """My name is {me}. My mother's name is {mom}. My sister's name is {sis}. 
{sis}'s mother's name is also {mom}""".format(me='Jason', sis='Jenny', mom='Margaret')
print(phrase)

My name is Jason. My mother's name is Margaret. My sister's name is Jenny. 
Jenny's mother's name is also Margaret


## Data Structures
There are three primary data structures that are native to Python:
- Lists
- Sets
- Dictionaries

Data Structures are Python objects that "hold" values. Each structure behaves a little differently and have appropriate times that they should be used. There is another data structure that we will deal with in the future (tuples) that are similar to lists, except that they are not mutable. More on that later.<br>

We will also be encountering new Data Structures when we start using the Pandas library. Until then we will focus just on the ones listed above.

#### Lists
Lists are containers that hold data in a particular order. The data need not be the same type and lists can be as long as you wish them to be. In their most basic form, lists are defined by square bracket enclosures separated by commas.

In [101]:
list1 = [1,2,3]
print(list1)

[1, 2, 3]


In [153]:
# Lists do not need to contain values of the same data type
list2 = [1, 2.0, 'three']
print(list2)

[1, 2.0, 'three']


In [107]:
# You can even store other data structures in lists
list3 = [1, 2.0, 'three', [1,2,3,4], {5,6,7}, {'Jason': 'Teacher', 'Ayo': 'Student', 'Steve': 'Director'}]
print(list3)
print('\nlist3 is {} elements long'.format(len(list3)))

[1, 2.0, 'three', [1, 2, 3, 4], {5, 6, 7}, {'Steve': 'Director', 'Jason': 'Teacher', 'Ayo': 'Student'}]

list3 is 6 elements long


Accessing elements of a list is done by the index value where the data is stored in the list. Indices begin at zero, not one. This can be confusing for many people. Therefore, if a list is n-elements long, then accessing the last value of a list is actualy the n-1 index

In [149]:
list3[0] #return the first value of the list

1

In [154]:
list3[5] #return the sixth value of the list. In this case it is another data structure (a dictionary)

{'Ayo': 'Student', 'Jason': 'Teacher', 'Steve': 'Director'}

In [151]:
list3[-1] #return the last value of the list (i.e. first value going backwards)

{'Ayo': 'Student', 'Jason': 'Teacher', 'Steve': 'Director'}

In [152]:
list3[-3] #return the third-to-last value of the list

[1, 2, 3, 4]

Slicing is when you take multiple consecutive elements of a list. The closing index of the range is actually the *position* where the element ends. So the first item in a list is defined as the item between positions zero and one. Therefore, when slicing you provide the starting position and end positions to return what is in between.

In [146]:
list4 = [1,2,3,4,5,6,7,8,9,10]
list1[0:4] #return the elements between the 0th and 4th positions. This would be first through third elements.

[1, 2, 3]

In [147]:
list4[1:4] #return the elements between the 1st and 4th positions. This would be the second through third elements.

[2, 3, 4]

In [148]:
list4[5:8] #return the elements between the 5th and 9th positions. This would be the sixth through eighth elements.

[6, 7, 8]

In [123]:
# Slicing outside of the range of a list will simply return an empty list
list4[15:20]

[]

In [125]:
# However, trying to access an index outside of the range of a list will return an error.
list4[15]

IndexError: list index out of range

#### Sets
In some ways, sets can be considered derivatives of lists. They can be created *from* lists but they have the following characteristics that deviate significantly from lists:
- Sets do not have any order. They exist as a collection of values that are ordered in the most efficient way the computer thinks it needs to be stored in order to quickly access the data.
- Sets can only contain what are considered to be "hashable" values. This includes integers, floats, strings, None values, and tuples. You cannot put lists or dictionaries into a set.
- Sets cannot have duplicate values. This includes floats and integers that represent the same number (e.g. 3 and 3.0 are considered the same thing for a set)
- You cannot retrieve specific values from a set. With lists, you can use indexing. There is no such possibility with sets. You can iterate over sets in a loop, which is one way to access data in a set. We will go over iterating in the next class. You can also "pop" a value, but there is no way to choose which value you get back.

Despite these challenges, sets are great for the following purposes:
- Eliminating duplicate values from a list to get only the unique values.
- Using two sets to VERY quickly find unions, intersections, differences, and other attributes to evaluate two sets of data.
- Looking for a specific value within a set of values.

In [136]:
set1 = {1, 2, 2.0, 3, 'a', 'b'} # Creating a set from scratch
print(set1)

{1, 2, 3, 'b', 'a'}


In [142]:
list5 = [None, (1,2), 3, 3.0]
set2 = set(list5) # Converting a list into a set

In [156]:
list7 = [None, 3, 3.0, [1,2,3], 'a'] # Lists are not "hashable" so they cannot be included as elements within a set
set3 = set(list7)

TypeError: unhashable type: 'list'

In [157]:
# You cannot access any specific value within a set
print(set2)
set[0]

{(1, 2), 3, None}


TypeError: 'type' object is not subscriptable

In [66]:
print(my_new_set)

{1, 2, 3, 4, 5, 6}


In [159]:
# Here I am creating lists of random numbers between 1 and 10,000,000. Each list has 1,000,000 values.
# You do not need to understand how I am doing this. Just know that these lists are pretty long.
import random
#generate long lists
long_list1 = [random.randint(1,10000000) for _ in range(1000000)]
long_list2 = [random.randint(1,10000000) for _ in range(1000000)]

In [171]:
# Using element-wise comparisons of lists can be very slow when dealing with long lists.
# I ran the line of code below for 20 minutes and still had not finished computing
x = [i for i in long_list1 if i in long_list2]

In [172]:
# Conversely, sets took a fraction of a second to finish computing.
start = dt.now()
long_set1 = set(long_list1)
long_set2 = set(long_list2)
intersection = long_set1 & long_set2
end = dt.now()
print("It took sets {} to find all of the common values in long_list1 and long_list2".format(end-start))

It took sets 0:00:00.320799 to find all of the common values in long_list1 and long_list2


In [174]:
# You can also use sets to find unions
union = set(long_list1) | set(long_list2)
print("There were {} values in the intersection and {} values in the union".format(len(intersection), len(union)))

There were 90511 values in the intersection and 1812629 values in the union


#### Dictionaries
Dictionaries are comprised of key-value pairs and combine many of the positive aspects of lists and sets. The key can be considered a sort of "address" and the value being the Python object you are trying to retrieve. In a list the "address" is the index of the value. Dictionaries have the following characteristics:
- Keys must be of one of the "hashable" types (Integer, float, string, tuple, None)
- The keys are the equivalent of a set in that they are a hashed index that is unordered
- "Values" can be of any data type (similar to a list)

Dictionaries are great for the following purposes:
- Efficiently storing values in an easy-to-find place
- Being able to name the location of where your values are located
- Passing around many different values in a single object

In [177]:
# Dictionaries are defined by curly braces, similar to sets
# Key-value pairs are separated by commas, similar to sets
# Keys and valuess are connected by a colon. There are no requirements for having (or not having) spaces around the colon
my_dict1 = {'Jason': 'Teacher', 'Ayo': 'Student'}
print(my_dict1)

{'Jason': 'Teacher', 'Ayo': 'Student'}


In [178]:
# Access the data associated with 'Jason'
my_dict1['Jason']

'Teacher'

In [180]:
# Let's create a more useful dictionary
useful_dict = {'Jason': {'Age': 39, 'Job': 'SVP Data Science', 'Pets': ['Copland', 'Gershwin']},
               'Barack': {'Age': 57, 'Job': 'Retired', 'Pets': ['Bo']}, 
               'Jenny': {'Age': 44, 'Job': 'Finance Manager', 'Pets': None}}
print(useful_dict)

{'Barack': {'Pets': ['Bo'], 'Job': 'Retired', 'Age': 57}, 'Jenny': {'Pets': None, 'Job': 'Finance Manager', 'Age': 44}, 'Jason': {'Pets': ['Copland', 'Gershwin'], 'Job': 'SVP Data Science', 'Age': 39}}


In [181]:
# Let's get my details
useful_dict['Jason']

{'Age': 39, 'Job': 'SVP Data Science', 'Pets': ['Copland', 'Gershwin']}

In [182]:
# Let's get my age
useful_dict['Jason']['Age']

39

In [185]:
# Let's get my first pet
useful_dict['Jason']['Pets'][0]

'Copland'

In [183]:
# Let's get my age but forget to capitalize 'Age'
useful_dict['Jason']['age']

KeyError: 'age'

In [184]:
# Let's get all of the keys for the useful dictionary
useful_dict.keys() #Note that the order of the names is different than how I entered them

dict_keys(['Barack', 'Jenny', 'Jason'])

In [187]:
# Lets add a new person and new value. There is no requirement for the values to look like the values of other keys.
useful_dict['Margaret'] = "The best mother I've ever had"
print(useful_dict)

{'Barack': {'Pets': ['Bo'], 'Job': 'Retired', 'Age': 57}, 'Margaret': "The best mother I've ever had", 'Jenny': {'Pets': None, 'Job': 'Finance Manager', 'Age': 44}, 'Jason': {'Pets': ['Copland', 'Gershwin'], 'Job': 'SVP Data Science', 'Age': 39}}


# Applied Problem
Let's imagine we are trying to wrap a rope around a cylinder. The rope has a length of 100m and the radius of the cylinder is 5.

In [188]:
rope_length = 100
pi = 3.141592653589
radius = 5

In [190]:
# Let's see how many times around the rope will fully wrap around the cylinder
circumference = 2*radius*pi
times_around = rope_length // circumference #Use integer division
print(times_around)

3.0


In [191]:
# Let's see how much rope will be left over after fully looping around the cylinder as much as possible.
remaining = rope_length % circumference #Here is one use case of the modulo
print(remaining)

5.7522203923299955


In [193]:
# How much rope did I use wrapping around the cylinder?
rope_used = circumference * (rope_length // circumference)
print(rope_used)

94.24777960767
