# Python Data Structures and String Manipulation
In this chapter, we'll refresh our knowledge of the main data structures used in Python. We'll cover how to deal with lists, tuples, sets, and dictionaries. We'll also consider strings and how to write regular expressions to retrieve specific character sequences from a given text.
# 1. What are the main data structures in Python?
## 1.1 List methods
Let's imagine a situation: you went to the market and filled your baskets (basket1 and basket2) with fruits. You wanted to have one of each kind but realized that some fruits were put in both baskets.

Task 1. Your first task is to remove everything from basket2 that is already present in basket1.

Task 2. After the removal it is reasonable to anticipate that one of the baskets might weight more compared to the another (all fruit kinds weight the same). Therefore, the second task is to transfer some fruits from a heavier basket to the lighter one to get approximately the same weight/amount of fruits.

In [1]:
basket1 = ['banana', 'kiwifruits', 'grapefruits', 'apples', 'apricots', 'nectarines', 'oranges', 'peaches', 'pears', 'lemons']
basket2 = ['apples', 'grapes', 'apricots', 'dragonfruits', 'peaches', 'pears', 'limes', 'papaya']

### Instructions:
* Remove fruits from basket2 that are already present in basket1.
* Transfer fruits from basket1 to basket2 until the amount in basket2 becomes more or equal to the amount in basket1.

In [2]:
# Remove fruits from basket2 that are present in basket1
for item in basket1:
    if item in basket2:
        basket2.remove(item)

print('Basket 1: ' + str(basket1))
print('Basket 2: ' + str(basket2))

Basket 1: ['banana', 'kiwifruits', 'grapefruits', 'apples', 'apricots', 'nectarines', 'oranges', 'peaches', 'pears', 'lemons']
Basket 2: ['grapes', 'dragonfruits', 'limes', 'papaya']


In [3]:
# Transfer fruits from basket1 to basket2
while len(basket1)>len(basket2):
    item_to_transfer = basket1.pop()
    basket2.append(item_to_transfer)

print('Basket 1: ' + str(basket1))
print('Basket 2: ' + str(basket2))

Basket 1: ['banana', 'kiwifruits', 'grapefruits', 'apples', 'apricots', 'nectarines', 'oranges']
Basket 2: ['grapes', 'dragonfruits', 'limes', 'papaya', 'lemons', 'pears', 'peaches']


## 1.2 Operations on sets
Putting the information on sets in more mathematical terms, we can define the following operations given two sets X and Y:

  $X {\cap} Y$ - the intersection between $X$ and $Y$ (all elements which are in both $X$ and $Y$)

  $X {\cup} Y $ - the union between $X$ and $Y$ (all elements which are either in $X$ or $Y$)

  $X−Y$ - the difference between $X$ and $Y$ (all elements which are in $X$ but not in $Y$)

You are given 5 sets of integers `A`, `B`, `C`, `D`,`E`. What is the result of the following expression?
$(A{\cup}(B{\cap}C))−(D{\cap}E)$
### Possible Answers:
1. {2}
2. {}
3. {1, 2}
4. {1, 2, 3, 4, 5, 6, 7}

In [4]:
A = {1, 2, 3, 4, 5, 6, 7}
B = {5, 7, 9, 11, 13, 15}
C = {1, 2, 8, 10, 11, 12, 13, 14, 15, 16, 17}
D = {1, 3, 5, 7, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}
E = {9, 10, 11, 12, 13, 14, 15}

In [5]:
(A|(B&C)-(D&E))

{1, 2, 3, 4, 5, 6, 7}

## 1.3 Storing data in a dictionary
The surface you see below is called circular paraboloid:

<div align="left">
    <img src="_datasets/Circular_Paraboloid_Small.png" alt="Circular Parabolid">
</div> 

It can be described by the following equation:

$$\frac{x^2}{a^2}+\frac{y^2}{a^2}=z$$

Let's set the coefficient $a$ to $1$. Therefore, the radius at each cut will be equal to $\sqrt{z}$.

Your task is to create a dictionary that stores the mapping from the pair of coordinates $(x,y)$ to the z coordinate (the lists storing considered ranges for $x$ and $y$ are given: `range_x` and `range_y`, respectively).

### Instructions:
* Calculate the value for $z$ coordinate using coordinates $x$ and $y$.
* Create a new key for the dictionary `circ_parab` represented as a tuple containing $x$ and $y$.
* Create a new key-value pair for `circ_parab`.

In [6]:
range_x = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0]
range_y = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0]

In [7]:
circ_parab = dict()

for x in range_x:
    for y in range_y:        
        # Calculate the value for z
        z = (x**2+y**2)
        # Create a new key for the dictionary
        key = (x,y)
        # Create a new key-value pair      
        circ_parab[key] = z

### Question:
* What is the value of `circ_parab` for the key `(1.8, 1.4)`?

In [8]:
circ_parab[1.8, 1.4]

5.2

### Question:
Is it possible to use a list instead of a tuple for a key in the circ_parab dictionary?

__Answer__:
No, because a list is mutable and the operation will result in TypeError.

# 2. What are common ways to manipulate strings?
## 2.1 String indexing and concatenation
You are presented with one of the earliest known encryption techniques - Caesar cipher. It is based on a simple shift of each letter in a message by a certain number of positions down the given alphabet. For example, given the English alphabet, a shift of 1 for `'xyz'` would imply `'yza'` and vice versa in case of decryption. Notice that `'z'` becomes `'a'` in this case.

Thus, encryption/decryption requires two arguments: text and an integer key denoting the shift (`key = 1` for the example above).

Your task is to create an encryption function given the English alphabet stored in the `alphabet` string.

### Instructions:
* Fill in the blanks in the loop to create an encrypted text.
* Check the encryption function with the shift equals to 10 (it should return `'nkdkmkwz'`).

In [9]:
alphabet = "abcdefghijklmnopqrstuvwxyz"

In [10]:
def encrypt(text, key):
  
    result = ''

    # Fill in the blanks to create an encrypted text
    for char in text.lower():
        idx = (alphabet.index(char) + key) % len(alphabet)
        result = result + alphabet[idx]

    return result

# Check the encryption function with the shift equals to 10
print(encrypt("datacamp", 10))

nkdkmkwz


### Question
Interestingly, decryption function is only different by the line you fixed in the for loop. What would be the corresponding change in the `decrypt()` function?

__Possible Answers__:
1. `idx = alphabet.index(char) - key`
2. `idx = (alphabet.index(char) - key) % len(alphabet)`
3. `idx = alphabet.indx(char) + key`
4. `idx = (alphabet.indx(char) * key) % len(alphabet)`

__Answer__:
It is enough to only subtract since Python allows negative indexing (1).

## 2.2 Operations on strings
You are given the variable `text` storing the following string `'StRing ObJeCts haVe mANy inTEResting pROPerTies'`.

Your task is to modify this string in such a way that would result in `'string OBJECTS have MANY interesting PROPERTIES'` (every other word in `text` is uppercased and lowercased, otherwise). You will obtain this result in three steps.

### Instructions:
* Create a word list from the given string.
* Make every other word uppercased and lowercased, otherwise.
* Join the words and form a new string and check the newly created string.

In [11]:
text = 'StRing ObJeCts haVe mANy inTEResting pROPerTies'

In [12]:
# Create a word list from the string stored in 'text'
word_list = text.split()

# Make every other word uppercased; otherwise - lowercased
for i in range(len(word_list)):
    if (i + 1) % 2 == 0:
        word_list[i] = word_list[i].upper()
    else:
        word_list[i] = word_list[i].lower()
        
# Join the words back and form a new string
new_text = ' '.join(word_list)
print(new_text)

string OBJECTS have MANY interesting PROPERTIES


## 2.3 Fixing string errors in a DataFrame
You are given the `heroes` dataset containing the information on different comic book heroes. However, you'll need to make some refinements in order to use this dataset further.

Comparing `Eye color`, `Hair color`, and `Skin color` columns, you can see that strings in the `Hair color` columns are capitalized, whereas in other two the strings are lowercased.

Moreover, some rows in the `Gender` column contain a spelling error (`Fmale` instead of `Female`).

Your task is to make the strings in the `Hair column` lowercased and to fix the spelling error in the `Gender` column.

### Instructions:
* Make all the values in the `Hair color` column lowercased.
* Substitute all the appearances of `Fmale` with `Female` in the `Gender` column.

In [13]:
import pandas as pd

# Uplodating the data
heroes = pd.read_csv('_datasets/heroes_information.csv', index_col = 0, na_values = ("-", -99))

# Introducing a spelling mistake in the 'Female' category
heroes.loc[(heroes.index % 4 == 0) & (heroes['Gender'] == 'Female'), 'Gender'] = 'Fmale'

In [14]:
# Make all the values in the 'Hair color' column lowercased
heroes['Hair color'] = heroes['Hair color'].str.lower()
  
# Check the values in the 'Hair color' column
print(heroes['Hair color'].value_counts())

black               161
blond               102
brown                87
no hair              75
red                  51
white                23
auburn               13
green                 8
strawberry blond      7
purple                5
grey                  5
brown / white         4
silver                4
blue                  3
yellow                2
orange                2
red / grey            1
magenta               1
indigo                1
pink                  1
red / white           1
brown / black         1
black / blue          1
orange / white        1
gold                  1
red / orange          1
Name: Hair color, dtype: int64


In [15]:
# Substitute 'Fmale' with 'Female' in the 'Gender' column
heroes['Gender'] = heroes['Gender'].str.replace('Fmale', 'Female')

# Check if there is no occurences of 'Fmale'
print(heroes['Gender'].value_counts())

Male      505
Female    200
Name: Gender, dtype: int64


Note that Series and DataFrames have their own `.replace()` method that deals with any kind of objects in addition to strings.

# 3. How to write regular expressions in Python?
## 3.1 Write a regular expression
Let's write some regular expressions!

Your task is to create a regular expression matching a valid temperature represented either in Celsius or Fahrenheit scale (e.g. `'+23.5 C'`, `'-4 F'`, `'0.0 C'`) and to extract all the appearances from the given string `text`. Positive temperatures can be prefixed with `+` or contain no prefix (e.g. `'5 F'`, `'+5 F'`). Negative temperatures must be prefixed with `-`. Zero temperature should not be prefixed with any symbol.

The `re` module is already imported.

Tip: the `+` symbol within the square brackets `[]` corresponds to the symbol itself (e.g. the regular expression `[1a+]` matches to `'1'`, `'a'`, or `'+'`).

Instructions:
* Define the pattern to search for valid temperatures
* Create an object storing the matches using finditer().
* Loop over matches_storage and print out item properties: the matching sequence, its start and end index.

In [16]:
import re

text = "Let's consider the following temperatures using the Celsius scale: +23 C, 0 C, -20.0 C, -2.2 C, -5.65 C, 0.0001 C.\
        To convert them to the Fahrenheit scale you have multiply the number by 9/5 and add 32 to the result.\
        Therefore, the corresponding temperatures in the Fahrenheit scale will be: \
        +73.4 F, 32 F, -4.0 F, +28.04 F, 21.83 F, +32.00018 F."

In [17]:
# Define the pattern to search for valid temperatures
pattern = re.compile('[+-]?\d+\.?\d* [CF]')

# Print the temperatures out
print(re.findall(pattern, text))

['+23 C', '0 C', '-20.0 C', '-2.2 C', '-5.65 C', '0.0001 C', '+73.4 F', '32 F', '-4.0 F', '+28.04 F', '21.83 F', '+32.00018 F']


In [18]:
# Create an object storing the matches using 'finditer()'
matches_storage = re.finditer(pattern, text)

# Loop over matches_storage and print out item properties
for match in matches_storage:
    print('matching sequence = ' + match.group())
    print('start index = ' + str(match.start()))
    print('end index = ' + str(match.end()) + '\n')

matching sequence = +23 C
start index = 67
end index = 72

matching sequence = 0 C
start index = 74
end index = 77

matching sequence = -20.0 C
start index = 79
end index = 86

matching sequence = -2.2 C
start index = 88
end index = 94

matching sequence = -5.65 C
start index = 96
end index = 103

matching sequence = 0.0001 C
start index = 105
end index = 113

matching sequence = +73.4 F
start index = 314
end index = 321

matching sequence = 32 F
start index = 323
end index = 327

matching sequence = -4.0 F
start index = 329
end index = 335

matching sequence = +28.04 F
start index = 337
end index = 345

matching sequence = 21.83 F
start index = 347
end index = 354

matching sequence = +32.00018 F
start index = 356
end index = 367



## 3.2 Find the incorrect pattern
Which of the following regular expressions will precisely match the long date format (for example, `October 26, 1988` or `Oct 26, 1988` with the first letter capitalized)?

Consider only the non-negative years.

### Possible Answers:
1. `\w+\s[1-3]?\d,\s\d+`
2. `[A-Z][a-z]+\s\d{1,2},\s\d+`
3. `[A-Z][a-z]+\s[1-3]?\d,\s\d+`
4. `[A-Z][a-z]+\s[1-3]?\d,\s\d*`

__Answer__:
* `\w+` can match also digits.
* `\d{1,2}` can match non-existing days like 98.
* `\d*` can correspond to no digit at all.

## 3.3 Splitting by a pattern
You are given a text stored in the `text` variable.

Split the text in such a way that the resulting list has only words or numbers with no blank spaces or punctuation.

Instructions:
* Compile the regular expression.
* Split the text so that only words or numbers are included in the resulting list, and print the result.
* Define a much easier way to extract words or numbers.

In [19]:
text = "Python has 4 main data structures: list, tuple, set, and dictionary."

In [20]:
# Compile the regular expression
pattern = re.compile(r' [,:\.]?\s?')

# Split the text that only words or numbers are left
words = re.split(pattern, text)
print(words)

['Python', 'has', '4', 'main', 'data', 'structures:', 'list,', 'tuple,', 'set,', 'and', 'dictionary.']


In [21]:
# Define an easier way to extract words or numbers
alt_pattern = re.compile('[\w]+')
print(re.findall(alt_pattern, text))

['Python', 'has', '4', 'main', 'data', 'structures', 'list', 'tuple', 'set', 'and', 'dictionary']


When given a task on regular expressions, there might be many ways to solve it. It is always better to think on the easiest one!