# Numpy Exercise

## Instructions
The same instructions from previous exercises apply.

Write the Python program in the cell below the text of the exercise (or create a new one). Always print the final result to the screen to verify the correctness of the exercise. Sometimes, we provide some important concepts in a code cell below the exercise text, try running it and modify it if necessary to ensure you understand the required concepts.

## Submission
The same rules from previous exercises apply.

It is mandatory to **submit the solution for all exercises** (except for those marked as optional) **before the beginning of the next lesson** in the appropriate assignment on iCorsi. To submit:
- Run the entire notebook from scratch (`Kernel -> Restart & Run All`) and ensure that the solutions are as expected;
- Export the notebook in HTML format (`File -> Download as...`) and submit the resulting file.

If you were unable to complete one or more exercises, describe the problem encountered and **still submit the file with the rest of the solutions**.


In [1]:
# Import packaged for later usage
import pandas as pd
import numpy as np
import os

In [2]:
# Solution
array = np.arange(10, 50)
print(array)

[10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49]


In [3]:
matrice = np.random.rand(5,5)
print(matrice)

[[0.4776683  0.27027942 0.50824538 0.10360465 0.36354893]
 [0.96829041 0.17593027 0.20433795 0.18434451 0.89227986]
 [0.60114759 0.06680908 0.53671689 0.18881901 0.86827668]
 [0.65028693 0.00774496 0.41923919 0.4202204  0.2236351 ]
 [0.09681501 0.6233033  0.0062507  0.82792469 0.83504821]]


In [4]:
min = np.min(matrice)
max = np.max(matrice)
print(f"Min: {min}, Max: {max}")

Min: 0.006250700805777831, Max: 0.96829040504559


In [5]:
frame = np.ones((5, 5))
frame[1:-1, 1:-1] = 0
print(frame)

[[1. 1. 1. 1. 1.]
 [1. 0. 0. 0. 1.]
 [1. 0. 0. 0. 1.]
 [1. 0. 0. 0. 1.]
 [1. 1. 1. 1. 1.]]


In [6]:
def zeroFrame(arr, n=2):
    tmp = np.zeros(len(arr) + n)
    tmp[int(n/2):-int(n/2)] = arr
    return tmp
    
arr = zeroFrame(array)
print(arr)

[ 0. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.
 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44.
 45. 46. 47. 48. 49.  0.]


In [8]:
checkerboard = np.zeros((8, 8), dtype=bool)
checkerboard[1::2, ::2] = True
checkerboard[::2, 1::2] = True
print(checkerboard)

[[False  True False  True False  True False  True]
 [ True False  True False  True False  True False]
 [False  True False  True False  True False  True]
 [ True False  True False  True False  True False]
 [False  True False  True False  True False  True]
 [ True False  True False  True False  True False]
 [False  True False  True False  True False  True]
 [ True False  True False  True False  True False]]


In [9]:
def negate_between_5_and_8(arr):
    arr = np.array(arr)
    mask = (arr >= 5) & (arr <= 8)
    arr[mask] = -arr[mask]
    return arr
print(np.arange(1,11))
print(negate_between_5_and_8(np.arange(1,11)))

[ 1  2  3  4  5  6  7  8  9 10]
[ 1  2  3  4 -5 -6 -7 -8  9 10]


In [10]:
def find_closest_index(arr, x):
    arr = np.array(arr)
    index = np.argmin(np.abs(arr - x))
    return index

# Example usage
arr = [-1, 3, 6, 9, 5]
closest_index = find_closest_index(arr, 2)
print(closest_index)

1


## Exercise 1

Let's consider the online dating profiles dataset, which we mentioned in class.

Download the zip files from [this link](https://github.com/rudeboybert/JSE_OkCupid), unzip them, and place the files `profiles_revised.csv` and `essays_revised_and_shuffled.csv` in a `data` subdirectory in the current directory. Then, run the following cell, which uses the pandas library (which we will cover in detail later) to parse the CSV.

The current directory, where we expect to find the file, is this:

In [11]:
print(os.getcwd()+"/data")

/Users/lucamazza/School/datascience/data


In [13]:
df = pd.read_csv("data/profiles_revised.csv")
age = df["age"].values
sex = df["sex"].values

dfe = pd.read_csv("data/essays_revised_and_shuffled.csv")
essay = dfe["essay0"].values

### 1.1

Read the [codebook](https://github.com/rudeboybert/JSE_OkCupid/blob/master/okcupid_codebook_revised.txt) for the dataset, focusing on the three columns "age", "sex", and "essay0".

Explain the type of the `age`, `sex`, and `essay` arrays; visualize some elements from them.

In [14]:
# Solution 1.1
print(age)
print(sex)
print(essay[:5])

[22 36 37 ... 41 26 40]
['m' 'm' 'm' ... 'm' 'm' 'm']
['well hi there. my mantra this year is "<a class="ilink" href=\n"/interests?i=explore">explore</a>." that may mean walking down an\nuntrodden path, trying russian food, visiting another continent, or\njust staring at a snail to see who blinks first. it may be a\nstrange year, but it won\'t be boring.'
 nan
 'musician / writer / programmer from the woods of montana. dance\nmusic enthusiast, secondhand film critic. owner of too many\nguitars.'
 "nothing in this world can take the place of persistence. talent\nwill not; nothing is more common than unsuccessful people with\ntalent. genius will not; unrewarded genius is almost a proverb.\neducation will not; the world is full of educated derelicts.\npersistence and determination alone are omnipotent. -calvin\ncoolidge<br />\n<br />\ni'm a full time visual artist living and working in downtown sf.\ni'm pretty busy these days juggling a few projects and preparing\nfor a solo show in early

### 1.2

Analyze the arrays with the numpy methods you know and answer the following questions:
- What is the average, minimum, and maximum age of the users?
- How many are male, and how many are female?
- Does the average age of males differ significantly from the average age of females?
- How long are their introductions on average?
- Show the longest introduction.
- Can we determine if males tend to write more compared to females?

In [18]:
# Solution 1.2
average_age = np.mean(age)
minimum_age = np.min(age)
maximum_age = np.max(age)
number_of_males = np.sum(sex == 'm')
number_of_females = np.sum(sex == 'f')
average_age_males = np.mean(age[sex == 'm'])
average_age_females = np.mean(age[sex == 'f'])
average_length_introductions = np.mean([len(essay) for e in essay])
longest_introduction = essay[np.argmax([len(essay) for e in essay])]
males_write_more = np.mean([len(essay) for e in essay[sex == 'm']]) > np.mean([len(essay) for e in essay[sex == 'f']])

print(f"Average age: {average_age}")
print(f"Minimum age: {minimum_age}")
print(f"Maximum age: {maximum_age}")
print(f"Number of males: {number_of_males}")
print(f"Number of females: {number_of_females}")
print(f"Average age of males: {average_age_males}")
print(f"Average age of females: {average_age_females}")
print(f"Average length of introductions: {average_length_introductions} characters")
print(f"Longest introduction: {longest_introduction}")
print(f"Do males write more than females? {'Yes' if males_write_more else 'No'}")

Average age: 32.33540186167551
Minimum age: 17
Maximum age: 111
Number of males: 35829
Number of females: 24117
Average age of males: 32.012950403304586
Average age of females: 32.81444624124062
Average length of introductions: 59946.0 characters
Longest introduction: well hi there. my mantra this year is "<a class="ilink" href=
"/interests?i=explore">explore</a>." that may mean walking down an
untrodden path, trying russian food, visiting another continent, or
just staring at a snail to see who blinks first. it may be a
strange year, but it won't be boring.
Do males write more than females? No


## Exercise 2

### 2.1
Write a function `primeNumbers(N)` that takes as an (optional) input argument an integer `N` (default value `N=1000`) and returns an array with all prime numbers less than or equal to `N` (neither 0 nor 1 should be considered).

In [19]:
# Solution 2.1
def primeNumbers(N=1000):
    primes = []
    for num in range(2, N + 1):
        is_prime = True
        for i in range(2, int(np.sqrt(num)) + 1):
            if num % i == 0:
                is_prime = False
                break
        if is_prime:
            primes.append(num)
    return np.array(primes)

print(primeNumbers())

[  2   3   5   7  11  13  17  19  23  29  31  37  41  43  47  53  59  61
  67  71  73  79  83  89  97 101 103 107 109 113 127 131 137 139 149 151
 157 163 167 173 179 181 191 193 197 199 211 223 227 229 233 239 241 251
 257 263 269 271 277 281 283 293 307 311 313 317 331 337 347 349 353 359
 367 373 379 383 389 397 401 409 419 421 431 433 439 443 449 457 461 463
 467 479 487 491 499 503 509 521 523 541 547 557 563 569 571 577 587 593
 599 601 607 613 617 619 631 641 643 647 653 659 661 673 677 683 691 701
 709 719 727 733 739 743 751 757 761 769 773 787 797 809 811 821 823 827
 829 839 853 857 859 863 877 881 883 887 907 911 919 929 937 941 947 953
 967 971 977 983 991 997]


### 2.2
Write a function `createMyDictionary(N)` that takes as an (optional) input argument an integer `N` (default value `N=1000`) and returns a dictionary with two elements. The first element has the key "middle", associated with an array of all prime numbers between $\frac{1}{4}N$ e $\frac{3}{4}N$ (inclusive). The second element has the key "extremes" and contains the remaining prime numbers less than $N$. Hint: Use the previously created function `primeNumbers(N)`.

In [20]:
# Hint

# Remember that the logical AND (or OR) between 2 Boolean variables (or Boolean arrays)
# is performed in Python using the operator `&` (or `|` respectively)

array1 = np.array(np.arange(10))
print(f"array1: {array1}")

mask1 = (array1 > 2) & (array1 < 8)
print("mask1: ", mask1)

mask2 = (array1[0:3] == 2) | (array1[0:3] == 1)
print("mask2: ", mask2)

array1: [0 1 2 3 4 5 6 7 8 9]
mask1:  [False False False  True  True  True  True  True False False]
mask2:  [False  True  True]


In [21]:
# Solution 2.2
def createMyDictionary(N=1000):
    primes = primeNumbers(N)
    prime_counts = [0] * 10
    for prime in primes:
        for digit in str(prime):
            prime_counts[int(digit)] += 1
    return {
        'prime_numbers': primes,
        'prime_counts': np.array(prime_counts)
    }

print(createMyDictionary())

{'prime_numbers': array([  2,   3,   5,   7,  11,  13,  17,  19,  23,  29,  31,  37,  41,
        43,  47,  53,  59,  61,  67,  71,  73,  79,  83,  89,  97, 101,
       103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167,
       173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239,
       241, 251, 257, 263, 269, 271, 277, 281, 283, 293, 307, 311, 313,
       317, 331, 337, 347, 349, 353, 359, 367, 373, 379, 383, 389, 397,
       401, 409, 419, 421, 431, 433, 439, 443, 449, 457, 461, 463, 467,
       479, 487, 491, 499, 503, 509, 521, 523, 541, 547, 557, 563, 569,
       571, 577, 587, 593, 599, 601, 607, 613, 617, 619, 631, 641, 643,
       647, 653, 659, 661, 673, 677, 683, 691, 701, 709, 719, 727, 733,
       739, 743, 751, 757, 761, 769, 773, 787, 797, 809, 811, 821, 823,
       827, 829, 839, 853, 857, 859, 863, 877, 881, 883, 887, 907, 911,
       919, 929, 937, 941, 947, 953, 967, 971, 977, 983, 991, 997]), 'prime_counts': array([15, 78, 32, 75, 34, 33, 33