## Getting Starting with Python

[**Neal Caren**](mailto:neal.caren@gmail.com)  
University of North Carolina, Chapel Hill

This section introduces some of the most relevant aspects of working with Python for social scientists. This includes the different data types available and ways to modify them. 

In 2009, as part of his first State of the Union address, President Barack Obama said: 

> Let us invest in our people without leaving them a mountain of debt

To store that in Python, create a new variable called `sentence`

In [33]:
sentence =  'Let us invest in our people without leaving them a mountain of debt.'

The text is surrounded by a single quote (`'`) on each side. 
To make sure that you typed the tweet correctly, you can type `sentence`:

In [44]:
sentence

'Let us invest in our people without leaving them a mountain of debt.'

You can get almost the same response using the `print` function:



In [45]:
print(sentence)

Let us invest in our people without leaving them a mountain of debt.


The only difference is that the first response was wrapped in single quotes and the second wasn’t. As a side note, the single quotes weren’t because you put them there. If you used double quotes, Python would still show a single-quote.

In [46]:
sentence =  "Let us invest in our people without leaving them a mountain of debt."

sentence

'Let us invest in our people without leaving them a mountain of debt.'

In addition to `'` and `"`, strings can also be marked with a `'''`. This last one is particularly useful when your text contains contractions or quotation marks. 

In [47]:
new_sentence = '''Let's invest in our people without leaving them a mountain of debt.'''

In [48]:
print(new_sentence)

Let's invest in our people without leaving them a mountain of debt.


### Your turn
Create a new string called <code>food</code>  that is a sentence about your most recent meal. Display the contents of your new string. 


In [49]:
food = 'My standard lunch is a veggie burrito.'
print(food)

My standard lunch is a veggie burrito.


## Strings

Python has a few tools for manipulating text, such as `lower` for making the string lower-case.

In [12]:
sentence.lower()

'let us invest in our people without leaving them a mountain of debt.'

This did not alter the original string, however.

In [13]:
sentence

'Let us invest in our people without leaving them a mountain of debt.'

In Python, strings are immmutable, meaning once created, they can not be altered in place. We could store the results in a new variable.

In [14]:
lower_sentence = sentence.lower()

lower_sentence

'let us invest in our people without leaving them a mountain of debt.'

### Your turn
Create a new, lower cased version of your <code>food</code> string.

In [None]:

lower_food = food.lower()

print(lower_food)


We can also `replace` words within the string.

In [15]:
sentence.replace("nation", "country")

'Let us invest in our people without leaving them a mountain of debt.'

`replace` can also be used to remove text by not including anything with the replacement quotation marks.

In [17]:
sentence.replace(".", "")

'Let us invest in our people without leaving them a mountain of debt'

As before, this does not alter the original string. If you wanted to save the string edits, you would need to create a new variable.






In [18]:
edited_sentence = sentence.lower()
print(edited_sentence)

let us invest in our people without leaving them a mountain of debt.


If you were doing a series of manipulations, you could reuse a varaiable name, although it is best practices to keep a version of the original string in case you ever need to go back to it. 

In [19]:
edited_sentence = sentence.lower()
print(edited_sentence)

edited_sentence = edited_sentence.replace(".", "")
print(edited_sentence)

let us invest in our people without leaving them a mountain of debt.
let us invest in our people without leaving them a mountain of debt


You can also stack multiple transformations together, although combining too many may make your code harder to follow.

In [20]:
edited_sentence.replace(".", "").lower()

'let us invest in our people without leaving them a mountain of debt'

### Your turn
Create a new string called <code>boring</code> that removes the exclamation marks and capitalization from the sentence "Way to go!!!".  


In [None]:
boring = "Way to go!!!".lower().replace('!', '')

print(boring)

## Slicing

If you had a very long text, such as the entire text of the State of the Union, you might only want to look at the first few characters. In Python, this is called by slicing.

In [34]:
sentence

'Let us invest in our people without leaving them a mountain of debt.'

In [22]:
sentence[0:20]

'Let us invest in our'

A slice is signaled with brackets (`[]`). The first number is the starting position, where 0 indicates the beginning. This is followed by a colon (`:`) and then the end position, which, in this case, is a 20. Note that this is splitting on characters, not words.

Here is a section from the middle of the string:

In [23]:
sentence[20:32]

' people with'

For convience, if you ommit the number before the colon, it defaults to the string beginning.

In [24]:
sentence[:40]

'Let us invest in our people without leav'

Ommitting the second number defaults to the end.

In [25]:
sentence[40:]

'ing them a mountain of debt.'

Finally, negative numbers are interpreted as distance from the end of the string.

In [26]:
sentence[-20:]

' a mountain of debt.'

### Your turn
Create a new string called `s` that contains `The weather is hot and humid today.` Find the slices for each of the following :
* "The w"
* "today."
* "hot and humid"


In [31]:
s = 'The weather is hot and humid today.'

print(s[:5])
print(s[-6:])
print(s[15:28])

The w
today.
hot and humid


#### Numbers

We can also count the number of characters in a string with the `len` function.

In [53]:
len(sentence)

68

In this case, Python returned an interger instead of string. This also can be stored in a variable.

In [54]:
sentence_length = len(sentence)

In [55]:
sentence_length

68

### Your turn
What is the length of <code>How many dogs do you own?</code>? Store it in a variable called <code>sl</code>.


In [None]:
question = 'How many dogs do you own?'
sl = len(question)
print(sl)

Since the length of a string  is a number, we can do standard math operations with it.

In [56]:
print(sentence_length * 3)

204


In [57]:
print(sentence_length / 2)

34.0


In [58]:
print(sentence_length + sentence_length)

136


### Your turn
What is one-third the length of <code>sl</code>?

In [None]:
sl/3

As with strings, these can be saved in new variables.

In [60]:
double_length = sentence_length + sentence_length

print(double_length)

136


These same operators also work with strings.

In [61]:
print(sentence * 2)

Let us invest in our people without leaving them a mountain of debt.Let us invest in our people without leaving them a mountain of debt.


In [35]:
print(sentence + sentence)

Let us invest in our people without leaving them a mountain of debt.Let us invest in our people without leaving them a mountain of debt.


The operators can't be used to combine different data types, however.

In [63]:
print("The sentence was " + sentence_length + "characters.")

TypeError: can only concatenate str (not "int") to str

Conviently, the `str` function will convert an interger to a string.

In [65]:
print("The sentence was " + str(sentence_length) + " characters.")

The sentence was 68 characters.


I manually had to include the spaces before and after `sentence_length`. Otherwise, it all is smushed together. 

In [66]:
print("The sentence was" + str(sentence_length) + "characters.")

The sentence was68characters.


### Your turn

Print `The length of the word "hippopotamus" is [x].` where `[x]` is the length of the word hippopotamus  .


In [None]:
l = len('hippopotamus')

print('The length of the word "hippopotamus" is ' + str(l) + '.')

## Lists

You can also `split` the sentence into a series of strings. By default, this splits based on spaces and other whitespace characters such as a line break (`\n`) or tab character (`\t`). 

In [69]:
print(sentence.split())

['Let', 'us', 'invest', 'in', 'our', 'people', 'without', 'leaving', 'them', 'a', 'mountain', 'of', 'debt.']


What is returned here is a third data type (the first two were strings and intergers) called a list. A list is enclosed in brackets (`[]`) and the items are seperated by commas. In this case each item is in quotation marks because they are all strings. Items in a list, however, can be of any sort.

In [70]:
my_list = ['Speeches', 7, 'Data']
my_list

['Speeches', 7, 'Data']

While `len` returned the number of characters in a string, it returns the number the items in a list.

In [71]:
len(my_list)

3

In [72]:
sentence_length = len(sentence.split())
sentence_length

13

In the second example, the list created by `sentence.split()` is not saved in any way; only its length.

#### Your turn
Create a list called **ate** that includes at least three things you ate today. Use `len` to count the number of items in the list.




In [None]:
ate = ['apple', 'dosa', 'pizza slice']

print(len(ate))

Like, strings, lists can also be sliced. The first three items of a list:

In [75]:
words = sentence.split()
print(words[:3])

['Let', 'us', 'invest']


We can also extract specific items from a list by their position. As it did with strings, slicing in Python starts with 0.

In [76]:
words[0]

'Let'

The third word:

In [78]:
words[2]

'invest'

The fifth word from the end:

In [79]:
words[-5]

'them'

The last two words:

In [80]:
words[-2:]

['of', 'debt.']

### Your turn
Display the first two items of your **ate** list. What is the last item?


In [83]:
print(ate[:2])
print(ate[-1])

['apple', 'dosa']
pizza slice


Slicing a list returns a list. If you ask for the first three items, you will get a list made up of those items. In contrast, if your request a specific location, such as `words[2]`, Python returns the specific object stored in the place, which may be a string, number, or event an entire list. 

Unlike a string, lists are mutable. That means that we can remove or as is more frequently the case text analysis, add things to it. This is done with `append`.

In [1]:
male_words = ['his', 'him', 'father']
male_words.append('brother')
print(male_words)

['his', 'him', 'father', 'brother']


Since `append` is changing `male_words`, we do not want to use an `=`. The Python interpreter is editing our original list but not returning anything.

In [2]:
not_going_to_work = male_words.append('brother')
print(not_going_to_work)

None


Lists can be also be combined using `+`.

In [3]:
gendered_words = male_words + ['her', 'she', 'mother']
print(gendered_words)

['his', 'him', 'father', 'brother', 'brother', 'her', 'she', 'mother']


As note above, the items in a list can include a variety of data types. This includes lists.

In [4]:
gendered_lists = [ male_words ,  ['her', 'she', 'mother'] ]

Note the two closing brackets next to each other. The first closes the list that ends with 'mother' while the second closes our `gendered_lists`.

In [5]:
len(gendered_lists)

2

`gendered_lists` has a length of two because it contains just two items, each a list of varying lengths.

In [6]:
print(gendered_lists)

[['his', 'him', 'father', 'brother', 'brother'], ['her', 'she', 'mother']]


### Your turn
Add three more items to your `food` list. Use `append` for one. 
For the other two, places them in a list and then combine the two lists. 

In [None]:
# solution

### Dictionaries

A fourth useful data type is a dictionary. A dictionary is like a list in that it holds multiples items. The items in a list can be identified by their position in the list. In contrast, the values in a dictionary are associated with a keyword. The analogy here is a to a physical dictionary, which has a list of unique words, and each word has a definition. In this case, the entries are called keys, and the definitions, which can be any data type, are called values. 

Alternatively, you can think of a dictionary as a single row of data from a dataset, where the keys are the variable names.

In [7]:
respondent = {'sex'   : "female",
              'abany' : 1,
              'educ'  : 'College'}

Dictionaries are surrounded by curly brackets (`{}`). Each entry is pair consisting of the key, which must be a string, followed, by a colon and then the value. Like in a list, entries are seperated by commas.



You can access the contents of a dictionary by enclosing the key in brackets (`[]`).

In [9]:
respondent['sex']

'female'

If the key is not dictionary, you will get a `KeyError`.

In [10]:
respondent['gender']

KeyError: 'gender'

You can inspect all the keys in a dictionary, in case you forgot or someone else made it.

In [11]:
respondent.keys()

dict_keys(['sex', 'abany', 'educ'])

In [12]:
len(respondent.keys())

3

Dictionaries are mutable, so we can change the value of existing keys, remove keys, or add new ones.

In [13]:
respondent['race'] = 'Black'

print(respondent)

{'sex': 'female', 'abany': 1, 'educ': 'College', 'race': 'Black'}


In [14]:
respondent['abany'] = 'Yes'

print(respondent)

{'sex': 'female', 'abany': 'Yes', 'educ': 'College', 'race': 'Black'}


### Your turn
Add a new key to the dictionary called `age` with a value of 37. Confirm that you did it correctly by displaying the value of `age`.



In [None]:
respondent['age'] = 37

print(respondent['age'])

As noted above, while the keys have to be strings, the values can be any data type. You could add the ages of the respondent's children as a list.

In [17]:
respondent['children ages'] = [3, 5, 10]

print(respondent)

{'sex': 'female', 'abany': 'Yes', 'educ': 'College', 'race': 'Black', 'age': 37, 'children ages': [3, 5, 10]}


## Spaces


Within the Python community, there are strong norms about how code should be written. Many of these are centered around have code be readable, both by others and by your future self. As a trivial example, `2+2` is allowed, by is almost always written `2 + 2`. Likewise I defined my respondent dictionary with plenty of white space in order to maximize readability.

In [18]:
respondent = {'sex'   : "female",
              'abany' : 1,
              'educ'  : 'College'}

This is identical to:

In [19]:
respondent={'sex':'female','abany':1,'educ':'College'}




but putting it all on one line obscures the logic of the dictionary. In this case, what is a key and what is a value is quite clear in the first version, while distinguishing between the two is more problematic in the single-line version. 



In [20]:
r2 = {'sex':'male',   'abany':1, 'educ':'College'     }
r3 = {'sex':'female', 'abany':0, 'educ':'High School' }
r4 = {'sex':'male',   'abany':0, 'educ':'Some College'}

In [21]:
respondents = [respondent, r2, r3, r4]

In [22]:
respondents

[{'sex': 'female', 'abany': 1, 'educ': 'College'},
 {'sex': 'male', 'abany': 1, 'educ': 'College'},
 {'sex': 'female', 'abany': 0, 'educ': 'High School'},
 {'sex': 'male', 'abany': 0, 'educ': 'Some College'}]

This is now looks a lot like the common data format JSON!

### Loops

In [23]:
for person in respondents:
    print(person['educ'])

College
College
High School
Some College


In [24]:
for item in [1,2,'bobcat']:
    print(item)

1
2
bobcat


#### Your turn

Loop over the items in your `food` list. For each item, print its length.



In [None]:
# Answer



### Functions

For those who come from Stata or R background, one of the more striking aspects of Python code is the frequency of user defined functions. They are deployed not just for things where you think there should be a function, like counting words in a sentence, but also for highly-custom situations, such as scraping the contents of a particular web page. This style of programming, with many small functions, tends to make code more readable easier to debug than code written in a more traditional social science style.

A standard function has three parts. First the function is named and defined. Subsequent line or lines actually do the thing. Finally the results are returned.  

A trivial function that returns the `Hello!` might look like:

In [25]:
def make_hello():
    word = 'Hello!'
    return word

Of note, `def` signals that your are defining a function. This is followed by the name of the function. In this case, `make_hello`. Since this function doesn't take any arguments, such as accepting a variable to modify or have any options, it is followed by `()`. The first line ends with a colon.

All subsequent lines are indented. The second line creates a new string variable called `word` which contains `Hello!`. The third and final line of the functions returns the value stored in word. 

In [26]:
make_hello()

'Hello!'

More commonly in text analysis, a user-defined function modifies an existing string. In this case, the variable name that will be used within the function is established within the parenthesis on the opening line. 

A second trivial function takes a text string and returns an all-caps version. 

In [27]:
def scream(text):
    text_upper = text.upper()
    return text_upper

In [28]:
scream('Hi there!')

'HI THERE!'

The `text` and `text_upper` variable only exist within the function. That means that you can pass a variable not called `text` to the function.

In [36]:
scream(sentence)

'LET US INVEST IN OUR PEOPLE WITHOUT LEAVING THEM A MOUNTAIN OF DEBT.'

It also means that everything but the returned `text` disappears.

In [37]:
text_upper

NameError: name 'text_upper' is not defined

It is a good idea to include a comment within the function that explains the function. This is helpful for other people reading your code and when you return to your own code months and days later.

In [38]:
def scream(text):
    '''Returns an all-caps version of text string.'''
    text_upper = text.upper()
    return text_upper

#### Your turn

Make a function called `whisper` that replaces all exclamation marks with a period and returns a lower case version of a string. Test it out.

In [39]:
def whisper(text):
    ''''''
    
    return quite_text