# Other data types, files and Python packages

## List Comprehensions

`for`- loops can be expressed in form of *list comprehensions*:

`[i for i in x]`

In [None]:
input_list = []
for number in range(1, 10): # from 1 (including) through 10 (excluding)
    input_list.append(number ** 2)
print(input_list)

In [None]:
new = [x ** 2 for x in range(1, 10)]
new

List Comprehensions can also take conditions:    
`[i for i in x if condition]`

In [1]:
test = [str(x ** 2) for x in range(1, 10) if x % 2 != 0]
test
print(test)

['1', '9', '25', '49', '81']


Through List Comprehensions and the String function `join()` multiple String elements can be combined to a single String:

In [None]:
many_strings = ['#spam', '#bacon', '#eggs']
one_string = ', '.join([s.upper() for s
                        in many_strings])
print(one_string)

## Data type: [Tuple](https://docs.python.org/3.3/tutorial/datastructures.html#tuples-and-sequences)

Immutable sequences of arbitrary objects and are initialized with `(` and closed with `)`. Tuples have the same properties of lists but are *immutable*.

In [2]:
tupel = (2, 3)
print(tupel[0])
type(tupel)

2


tuple

### Iterating Lists-of-Tuples

Allows to iterate synchronously over multiple sequences of equal length. Therefore, the sequences need to be transfered into a list of tuples. This is done with the `zip()` function.

In [3]:
L1 = [1988, 1964, 1912, 1996, 2008, 1976]
L2 = ['Seoul', 'Tokio', 'Stockholm', 'Atlanta', 'Peking', 'Montreal']


lot = list(zip(L1, L2)) # generates a list of tuples of the zipped object
print(lot)

[(1988, 'Seoul'), (1964, 'Tokio'), (1912, 'Stockholm'), (1996, 'Atlanta'), (2008, 'Peking'), (1976, 'Montreal')]


When sorting tuples Python uses the first element by default:

In [6]:
print(lot)

[(1988, 'Seoul'), (1964, 'Tokio'), (1912, 'Stockholm'), (1996, 'Atlanta'), (2008, 'Peking'), (1976, 'Montreal')]


In [5]:
print(sorted(lot))

[(1912, 'Stockholm'), (1964, 'Tokio'), (1976, 'Montreal'), (1988, 'Seoul'), (1996, 'Atlanta'), (2008, 'Peking')]


In [7]:
type(lot[1][1][1]) # of which element do we query the type?

str

In [11]:
lot[1][1][1]

'o'

## Data type: [Dictionaries](https://docs.python.org/3.3/tutorial/datastructures.html#dictionaries)

- Apart from lists are dictionaries the most important data containers in Python since the heavily used web format JSON can be directly converted into dictionaries.
- Dictionaries are unsorted collections of elements that can be individually queried through a unique key. Dictionaries are opened with `{` and closed with `}`. `key-value` pairs consist of a key and a corresponding value, both are separated with `:` from each other.

`D = {key1: value1, key2: value2, ...}`

In [12]:
D = {'1988': 'Seoul', '2008': ['Peking', 5], 1988: 'Seoul'}
print(D)
type(D)

{'1988': 'Seoul', '2008': ['Peking', 5], 1988: 'Seoul'}


dict

In [19]:
D['2008'][1]

5

In [None]:
print(D['2008'])

In [20]:
D[1988] = 'new value'
print(D)

{'1988': 'Seoul', '2008': ['Peking', 5], 1988: 'new value'}


In [21]:
for key in D.keys():
    print(key, D[key])

1988 Seoul
2008 ['Peking', 5]
1988 new value


Dictionaries can just as lists be nested over multiple hierarchical levels:

In [22]:
nested = {'structured': {'spreadsheets': 'flat', 'json': 'tree'},
          'unstructured':  [{'type': 'natural language', 'format': 'txt'}, 
                              {'type' : 'image', 'format': 'png'}]}

In [None]:
print(nested['structured']['json'])

In [None]:
print(nested['unstructured'][0])    

## Exercise 1

Write Python code (e.g. using a `for` loop) that returns for all `unstructured` elements from the dictionary above (i.e. "nested") the corresponding `type`, and not type().

In [None]:
# Code for Exercise 1

### Exercise 2

Create a dictionary which `keys` are the id's of the Tweets below. The `values` of the Keys should be all Hashtag Strings of the particular Tweets contained **in a list**. *Bonus: Define a function to solve this exercise.*

https://twitter.com/BMBF_Bund/status/1199976385621692416    
https://twitter.com/WHO/status/1257937948424757248  

Solution example:
```
{'1199976385621692416': ['#openaccess', '#BMBF', '#Wissenschaft', '#OA', '#podcast'],
 '1257937948424757248': ['#COVID19', '#HealthForAll', '#coronavirus']}
```

In [None]:
tweets = [
      { # tweet 1
        'created_at': 'Thu Nov 28 09:00:39 +0000 2019',
        'favorite_count': 283,
        'full_text': 'Reden wir offen ... über #openaccess! \nUnd zwar in unserer Podcast-Mini-Serie. In der ersten Folge spricht Radiomoderator Holger Klein mit der Chemikerin Mai Thi Nguyen-Kim @maithi_nk. Jetzt hier reinhören 🎧: https://t.co/aRRo23c0t0\n@wrint_de #BMBF #Wissenschaft #OA #podcast https://t.co/EzyqBc4WAK',
        'hashtags': [{
            'text': 'openaccess'
          },
          {
            'text': 'BMBF'
          },
          {
            'text': 'Wissenschaft'
          },
          {
            'text': 'OA'
          },
          {
            'text': 'podcast'
          }
        ],
        'id': 1199976385621692416, # id 1
        'id_str': '1199976385621692416',
        'lang': 'de',
        'media': [{
          'display_url': 'pic.twitter.com/EzyqBc4WAK',
          'expanded_url': 'https://twitter.com/BMBF_Bund/status/1199976385621692416/photo/1',
          'id': 1199976381893005314,
          'media_url': 'http://pbs.twimg.com/media/EKcsKV2XUAIvNj6.jpg',
          'media_url_https': 'https://pbs.twimg.com/media/EKcsKV2XUAIvNj6.jpg',
          'sizes': {
            'small': {
              'w': 680,
              'h': 453,
              'resize': 'fit'
            },
            'thumb': {
              'w': 150,
              'h': 150,
              'resize': 'crop'
            },
            'medium': {
              'w': 1200,
              'h': 799,
              'resize': 'fit'
            },
            'large': {
              'w': 2048,
              'h': 1363,
              'resize': 'fit'
            }
          },
          'type': 'photo',
          'url': 'https://t.co/EzyqBc4WAK'
        }],
        'retweet_count': 73,
        'source': '<a href="https://www.hootsuite.com" rel="nofollow">Hootsuite Inc.</a>',
        'urls': [{
          'expanded_url': 'http://ow.ly/xymW50xmNGj',
          'url': 'https://t.co/aRRo23c0t0'
        }],
        'user': {
          'created_at': 'Fri Jan 09 13:29:16 +0000 2015',
          'default_profile': True,
          'description': 'Hier twittert die Social-Media-Redaktion des Bundesministeriums für Bildung und Forschung - https://t.co/UqNZkLfa5s',
          'favourites_count': 3803,
          'followers_count': 48480,
          'friends_count': 650,
          'geo_enabled': True,
          'id': 2969727718,
          'id_str': '2969727718',
          'listed_count': 596,
          'location': 'Berlin',
          'name': 'BMBF',
          'screen_name': 'BMBF_Bund',
          'statuses_count': 9852,
          'url': 'https://t.co/swEZR6KkpA',
          'verified': True
        },
        'user_mentions': [{
            'id': 1094849342,
            'id_str': '1094849342',
            'name': 'Mai Thi Nguyen-Kim',
            'screen_name': 'maithi_nk'
          },
          {
            'id': 303342452,
            'id_str': '303342452',
            'name': 'WRINT',
            'screen_name': 'wrint_de'
          }
        ]
      },

      { # tweet 2
        'created_at': 'Wed May 06 07:39:12 +0000 2020',
        'favorite_count': 750,
        'full_text': 'In a little over 3 months, #COVID19 has changed the world in so many ways, bringing us closer together and reaffirming the importance of #HealthForAll.\nThis video shows the key moments so far as WHO works with partners worldwide to fight #coronavirus and save lives. https://t.co/oYQV4DPbxa',
        'hashtags': [{
            'text': 'COVID19'
          },
          {
            'text': 'HealthForAll'
          },
          {
            'text': 'coronavirus'
          }
        ],
        'id': 1257937948424757248, # id 2
        'id_str': '1257937948424757248',
        'lang': 'en',
        'media': [{
          'display_url': 'pic.twitter.com/oYQV4DPbxa',
          'expanded_url': 'https://twitter.com/WHO/status/1257937948424757248/video/1',
          'id': 1257937273561190400,
          'media_url': 'http://pbs.twimg.com/amplify_video_thumb/1257937273561190400/img/JvLTmXi97KVXwviA.jpg',
          'media_url_https': 'https://pbs.twimg.com/amplify_video_thumb/1257937273561190400/img/JvLTmXi97KVXwviA.jpg',
          'sizes': {
            'thumb': {
              'w': 150,
              'h': 150,
              'resize': 'crop'
            },
            'medium': {
              'w': 1200,
              'h': 675,
              'resize': 'fit'
            },
            'small': {
              'w': 680,
              'h': 383,
              'resize': 'fit'
            },
            'large': {
              'w': 1280,
              'h': 720,
              'resize': 'fit'
            }
          },
          'type': 'video',
          'url': 'https://t.co/oYQV4DPbxa',
          'video_info': {
            'aspect_ratio': [16, 9],
            'duration_millis': 325600,
            'variants': [{
                'bitrate': 288000,
                'content_type': 'video/mp4',
                'url': 'https://video.twimg.com/amplify_video/1257937273561190400/vid/480x270/Qp6JiURxItBa9H6c.mp4?tag=13'
              },
              {
                'bitrate': 832000,
                'content_type': 'video/mp4',
                'url': 'https://video.twimg.com/amplify_video/1257937273561190400/vid/640x360/nqIWSKAqarkV9xRw.mp4?tag=13'
              },
              {
                'content_type': 'application/x-mpegURL',
                'url': 'https://video.twimg.com/amplify_video/1257937273561190400/pl/JUsDgAz2USLHj8JK.m3u8?tag=13'
              },
              {
                'bitrate': 2176000,
                'content_type': 'video/mp4',
                'url': 'https://video.twimg.com/amplify_video/1257937273561190400/vid/1280x720/frWD4cpJVdQsBDDH.mp4?tag=13'
              }
            ]
          }
        }],
        'retweet_count': 324,
        'source': '<a href="https://studio.twitter.com" rel="nofollow">Twitter Media Studio</a>',
        'urls': [],
        'user': {
          'created_at': 'Wed Apr 23 19:56:27 +0000 2008',
          'description': 'We are the #UnitedNations’ health agency. We are committed to achieving better health for everyone, everywhere - #HealthForAll',
          'favourites_count': 10673,
          'followers_count': 7692786,
          'friends_count': 1719,
          'geo_enabled': True,
          'id': 14499829,
          'id_str': '14499829',
          'listed_count': 32284,
          'location': 'Geneva, Switzerland',
          'name': 'World Health Organization (WHO)',
          'screen_name': 'WHO',
          'statuses_count': 50956,
          'url': 'https://t.co/wVulKuROWG',
          'verified': True
        },
        'user_mentions': []
      }]

In [None]:
# Code Exercise 2

## Files (I/O)

File Methods (I/O) are used to read data into Python and/or to store output after processing. The Most important I/O methods are:

* `.write()`: writes a String object in a file.
* `.read()`: reads a file object and returns a text object.

To read and write data the function `open()` creates a file object. The first input determines the filename, the second the mode (`r` for read, `w` for write) and the third input determines the text encoding (e.g. `utf-8`).

Change working directory (NEEDS TO BE MODIFIED):

In [None]:
%cd "FILE PATH"

Create example list:

In [None]:
towrite = ['Encoding', 'makes', 'trouble', '€ncôdíng_mäkes_trouble!']

In [None]:
with open('myfile.txt', 'w', encoding = 'utf-8') as f:
    for word in towrite:
        f.write(word + '\n')

In [None]:
with open('myfile.txt', 'r', encoding = 'utf-8') as f:
    toread = f.read().split('\n')[:-1] # split
print(toread)

### Encoding texts

Computers store texts in form of numbers. For each text symbol, e.g. `a` there exists a number through which that symbol is [encoded](https://de.wikipedia.org/wiki/Zeichenkodierung).

In [None]:
from IPython.display import IFrame
IFrame("https://www.asciitable.com/", width = "800", height = "400")

If a programme receives the instruction to process under the wrong encoding the symbols cannot be processed correctly:

In [None]:
with open('myfile.txt', 'r', encoding = 'latin-1') as f:
    toread = f.read().split('\n') # split
print(toread)

Use `utf-8` as standard. If you don't know the original encoding when reading in data, you may have to test different encodings.

## Python packages

There exist thousands of additional packages that facilitate and implement particular problems. Especially for Data Scraping we will resort to non-native Python libraries.

Additional libraries are activated through the `import` statement. You can either import the entire package or only a particular method from it.

Examples: Package [time](https://docs.python.org/3/library/time.html) and function `Counter` from [collections](https://docs.python.org/3/library/collections.html).

In [None]:
import time
from collections import Counter

These modules are now available in the `namespace` and can be called:

In [None]:
print(time.localtime())

In [None]:
to_count = [1, 2, 3, 1, 7, '2', 4, 6, 9, 2, 7]
count_dic = Counter(to_count)
print(count_dic)

### JSON

The Python package [json](https://docs.python.org/3/library/json.html) allows reading and saving of data in the [JavaScript Object Notation](https://de.wikipedia.org/wiki/JavaScript_Object_Notation) format - short `JSON`. The most important functions are `load()` and `dump()`:

In [None]:
import json
nested = {'structured': {'spreadsheets': 'flat', 'json': 'tree'},
          'unstructured':  [{'type': 'natural language', 'format': 'txt'}, 
                              {'type' : 'image', 'format': 'png'}]}

with open('example_dict.json', 'w', encoding = 'utf-8') as f:
    json.dump(nested, # the to-be-stored data object
              f,      # the opened file object
             ensure_ascii = False, # UTF-8 compatibility
             indent = 2) # optional: indenting of nested data structures

In [None]:
with open('example_dict.json', 'r', encoding = 'utf-8') as f:
    nested2 = json.load(f)
    
nested == nested2

In [None]:
from pprint import pprint # readible output
pprint(nested, 
       indent = 5) # indenting output

## Exercise 3

Store both Tweets from Exercise 2 using Python I/O and JSON functions locally in your working directory.

In [None]:
# Code for exercise 3

<br>
<br>


___

                
**Contact: Gerome Wolf** (Email: wolfgerome@gmail.com)