# Python data types

Python has different data types to hold different information. A few:
* Integers (type int) hold whole numbers. No decimals!
* Floats (type float) hold decimal point numbers
* Strings (type str) (hold textual content - can hold a number but the number will be treated as a piece of textual information
* Booleans (type bool) hold either value True or value False.

Python comes with a built in function to obtain the data type of an object: type()

Additionally, Python comes with functions to transform objects from one data type to another, and these transformations follow set rules. bool(1) = True, and bool(0) = False, for instance.

In [None]:
#lets look at type()
#type() 

In [None]:
variable = '10'
#note use of '

In [None]:
type(variable)

In [None]:
type('123')
#note use of '

In [None]:
type(1)
#only a number

In [None]:
type(1.)
#decimal point

In [None]:
type(True) 
#note capital letter of True

## More data types
the list
* holds an arbitrary length of objects, in a set order
* denoted list() or more commonly []

the dictionary
* holds an arbitrary number of key and value pairs, not in a set order 
* each "item" has a key and a value. The key is used to select the contents of the "item" with the specified key
* denoted dict() or {}

In [1]:
#this is a Python list
[1,2,3]

[1, 2, 3]

In [2]:
#this also:
[]

[]

In [3]:
type([])

list

In [4]:
type([1,2,2])

list

In [5]:
#this is a Python dict, with keys 'A' and 'B'. A holds the integer 2 and B holds 3
D = {'A':2,'B':3}

In [6]:
#lets access 'B's value.
D['B']

3

In [7]:
type({})

dict

# Defining a function
in order to be able to build anything from simple to complex routines with the built in functions in the Python language, using functions is very beneficial. Using functions lays the groundwork for writing lean and efficient code.

In [10]:
#this is how the function defining syntax works
def funct(x): #initiates function definition for named function funct, which takes x as input
    """Write something informative of what input the function takes, 
    and what results it yields""" 
    #it is good practice to explain within """""" what the function takes as input and what it does and returns as output. However, it is not a requirement for it to work  
    
    return x/2 #the function will return the input x divided by two

In [12]:
funct(2)


1.0

In [13]:
funct(40)

20.0

In [14]:
# written by Huamin, Oct 10.
# x: a list of strings, such as ['  asaasef  as df   sdf  ds f asd f ', 'asdfsd sdf  s df d  ']
# the function will remove any white space in front as well as at the end of the sentence for each sentence in the list
# new_x: ['asaasef  as df   sdf  ds f asd f', 'asdfsd sdf  s df d']
# [''asaasef', 'as','df','sdf'.....]
def remove_whitespace(x):
    return new_x



Notice the 1.0 - the decimal point in the result? It is because Python produce floats from division, even when the answer is certainly a whole number.

In [None]:
#look how the function we made and a built in Python function can be used together
int(funct(2))

## Ex 1
Now that we have showed a few built in functions, and also how to write a custom function, lets practice some examples. Remember that using functions is good practice and makes for lean and efficient code. You will benefit from mastering the skill of "thinking in terms of functions" when we get to NLP implementations later on.

### Create a function that checks the data type of the input 

In [None]:
#yes, the example exercise is "unneccessary" as type() is already an available built in function. However, this is about mastering to put an existing Python behavior inside a user defined function. Then we can build more complex functions later on.

In [15]:
def typeCheck(x):
    y = type(x)
    return y

In [16]:
typeCheck(False)

bool

In [17]:
#in this simple case, our function is equal to rhe built in type()
type(False)

bool

# Introducing more Python behaviors
Now, lets introduce some more behaviors we can use in conjunction with or within our user-defined functions
# Loops
loops are an iterative way of doing the same action again and again on different input.

In [18]:
nums = [1,2,3,4] #we have this list, and want to print item for item of it:

for item in nums:
    print(item)

1
2
3
4


# the operator % ('modulo')
All of you are probably well familiar with the four basic mathematical operators + - * /
Imagine an operator which returns the remainder, rather than the result, from a division. That operator is modulo, denoted % in Python. This operator can be useful in some circumstances:

In [19]:

3%2 #three modulo two yields one. Remember that for a division with only whole numbers, the answer of 3/2 would be 1, and one as the remainder.

1

In [20]:
#because of this behavior, every even number modulo two should yield zero. Hence, we can use modulo to identify the even numbers in a list, using a loop
for item in nums:
    if item%2 == 0:
        print(item)

2
4


In [21]:
#lets use a loop to obtain a list of data types, corresponding to the data types of out input list
newList = []
for item in nums:
    newList.append(type(item))
print(newList)

[<class 'int'>, <class 'int'>, <class 'int'>, <class 'int'>]


In [22]:
#a loop can also be run through a range of numbers, as opposed to through a predefined list
for i in range(10):
    print(i)

0
1
2
3
4
5
6
7
8
9


## Ex 1b Use a loop to print the 20 first multiples of two

In [23]:
for i in range(1,21):
    print(i*2)

2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40


### If
now lets see how we can print only multiples of two that are also multiples of three, based on the previous cell.

We will do it by adding an if statement.

An if statement is a code line that evaluates a certain precpndition, and implements code under the if statement only if criterias are met. 


In [None]:
for i in range(1,21):
    if i%3==0: #evaluating if the i is a multiple of three
        print(i*2)

# Combining some of what we've seen so far:

## Ex 2
Create a __function__ that checks the data type of the input, and in case the input is a list, returns a list of data types corresponding to the data types in the list

In [24]:
def typeCheck(x):
    if type(x) == list: #if it is a list, the following will apply
        b = []
        for item in x:
            b.append(type(item))

        #use a loop to fill the resultList
        return b
    else: # if not a list, then the else-block will apply
        return type(x)

In [25]:
testList = ['a',1.0,1,False,[],{}]

In [26]:
typeCheck(testList)

[str, float, int, bool, list, dict]

In [27]:
typeCheck(1.0)

float

# String operations
string operations are important in NLP, even foundational for the libraries and functions we will use in NLP pipeline implementation.

In [28]:
'string'.capitalize()

'String'

In [29]:
'string'.upper()

'STRING'

In [30]:
'stRING'.lower()

'string'

In [31]:
'  string '.strip()

'string'

In [32]:
's'.isalnum()

True

In [33]:
'.'.isalnum()

False

In [34]:
'two strings'.split()

['two', 'strings']

In [35]:
'comma,separated,strings'.split(',')

['comma', 'separated', 'strings']

In [36]:
example = 'abc'
for letter in example:
    print(letter)

a
b
c


## Ex 3
Create a function that takes a word which may have spaces around it and may be in differenr capitalizations, and returns the lower case of the word with spaces removed

In [37]:
word = ' INPutword..   ' 

def cleaner(w):
    w = w.strip()
    w = w.lower()
    return w

cleaner(word)

'inputword..'

## joining a list, creating a string

In [38]:
list('abc')

['a', 'b', 'c']

In [39]:
('').join(['a','b','c'])

'abc'

## Ex 4
Create a function that takes a word which may have spaces around it, various capitalizations and punctuation and returns the lower case of the word with spaces and punctuation removed

In [41]:
word = ' INPutword..   ' 

def cleaner(w):
    w = w.strip()
    w = w.lower()
    w = list(w)
    w = [letter for letter in w if letter.isalnum()]
    w = ('').join(w)
    
    return w

cleaner(word)

'inputword'

# More functionality and behavior related to lists

In [42]:
letters = list('abc')
letters

['a', 'b', 'c']

In [43]:
nums = [1,2,3]
nums

[1, 2, 3]

In [44]:
len(nums)

3

In [45]:
nums[0]

1

In [46]:
nums[-1]

3

In [47]:
nums[:2]

[1, 2]

In [48]:
nums[-2:]

[2, 3]

In [49]:
nums

[1, 2, 3]

In [50]:
type(nums[0])

int

In [51]:
type(letters[0])

str

In [52]:
nums.append(4)
letters.append('d')
print(nums,letters)

[1, 2, 3, 4] ['a', 'b', 'c', 'd']


In [53]:
nums = [1,2,3,4]

In [54]:
letters = list('abcd')

In [55]:
nums.append(letters)
nums

[1, 2, 3, 4, ['a', 'b', 'c', 'd']]

In [56]:
nums[4]

['a', 'b', 'c', 'd']

In [57]:
nums[4][2]

'c'

In [58]:
popped = nums.pop()
popped

['a', 'b', 'c', 'd']

In [59]:
nums

[1, 2, 3, 4]

In [60]:
max(nums)

4

In [61]:
min(nums)

1

In [62]:
(' ').join(['two','words'])

'two words'

## Ex 5
Create a function that takes a list of words (either as a data type list), or as a set of words separated by space, and return a list with the lower case of each of the words, without spaces and punctuation

In [63]:
test = 'A sentence WITH several Words.'

In [64]:
test2 = ['A','sENTEnce','with','several.','words']

In [67]:
def tokenize(d):
    if type(d) != list:
        d = d.split()
    output = []
    for item in d:
            output.append(
                 ('').join([letter for letter in list(item.strip().lower()) if letter.isalnum()])
                 )
            
    return output
        

In [69]:
tokenize(test)

['a', 'sentence', 'with', 'several', 'words']

In [70]:
tokenize(test2)

['a', 'sentence', 'with', 'several', 'words']

## Ex 6
Create a function that takes a list of words (either as a data type list) or as as set of words separated by space, and return a the input as __one__ string where the words are lower case, punctuation has been removed, and the words are separated by a space

In [71]:
## Ex 6
#Create a function that takes a list of words (either as a data type list) or as as set of words separated by space, and return a the input as __one__ string where the words are lower case, punctuation has been removed, and the words are separated by a space

test = 'A sentence WITH several Words.'


def tokenize2(d):
    """Taking input as a document (string) or list of tokens, 
    returning a list of tokens after preprocessing, and a preprocessed string"""
    if type(d) != list:
        d = d.split()
    output = []
    for item in d:
            output.append(  ('').join([letter for letter in list(item.strip().lower()) if letter.isalnum()])  )
            
    return output,(' ').join(output)

In [72]:
tokens, document = tokenize2(test)
print(tokens)
print(document)

['a', 'sentence', 'with', 'several', 'words']
a sentence with several words


# cross-checking one list vs another

In [73]:
## cross-checking lists

inputs = ['some','words','to','process']
checklist = ['to', 'the', 'a', 'an', 'from']
result = [word for word in inputs if word not in checklist]

#result = []
#for word in inputs:
#    if word not in checklist:
#        result.append(word)
result

['some', 'words', 'process']

# Teaser on what the coming weeks will look further into,  and beyond

In [83]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stops = stopwords.words('english')
len(stops)

## Ex 6 create a function to split string to list of words (tokens), remove stopwords and reassemble as string

stringEx6 = """One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections. The bedding was hardly able to cover it and seemed ready to slide off any moment. His many legs, pitifully thin compared with the size of the rest of him, waved about helplessly as he looked. "What's happened to me?" he thought. It wasn't a dream. His room, a proper human room although a little too small, lay peacefully between its four familiar walls. A collection of textile samples lay spread out on the table - Samsa was a travelling salesman - and above it there hung a picture that he had recently cut out of an illustrated magazine and housed in a nice, gilded frame. It showed a lady fitted out with a fur hat and fur boa who sat upright, raising a heavy fur muff that covered the whole of her lower arm towards the viewer. Gregor then turned to look out the window at the dull weather. Drops"""

def tokenize2(d):
    if type(d) != list:
        d = d.split()
    output = []
    for item in d:
            output.append(  ('').join([letter for letter in list(item.strip().lower()) if letter.isalnum()])  )
            
    return output,(' ').join(output)       

def stopRemoval(d):
    output = [word for word in d if word not in stops]
    return output

def preprocessor(d):
    tokens, document = tokenize2(d)
    
    tokens = stopRemoval(tokens)
    
    return (' ').join(tokens)

newPara = preprocessor(stringEx6)
newPara

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jesselang/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


'one morning gregor samsa woke troubled dreams found transformed bed horrible vermin lay armourlike back lifted head little could see brown belly slightly domed divided arches stiff sections bedding hardly able cover seemed ready slide moment many legs pitifully thin compared size rest waved helplessly looked whats happened thought wasnt dream room proper human room although little small lay peacefully four familiar walls collection textile samples lay spread table  samsa travelling salesman  hung picture recently cut illustrated magazine housed nice gilded frame showed lady fitted fur hat fur boa sat upright raising heavy fur muff covered whole lower arm towards viewer gregor turned look window dull weather drops'

In [84]:
len(newPara)

722

In [85]:
len(stringEx6)

1079