------------
#### reading files into numpy
--------------

- Scientific data can come in a variety of file formats and types. - - import data into numpy arrays from two commonly used text file formats for scientific data:

    - Plain text files (.txt)
    - Comma-separated values files (.csv)
    
- `Plain Text Files`
    - Plain text files simply list out the values on separate lines without any symbols or delimiters to indicate separate values.

    - For example, average monthly precipitation (inches) for Boulder, CO, collected by the U.S. National Oceanic and Atmospheric Administration (NOAA), can be stored as a plain text file (.txt), with a separate line for each month’s value.

In [4]:
import numpy as np

- `CSV Files`
    - Unlike plain-text files which simply list out the values on separate lines without any symbols or delimiters, files containing comma-separated values (.csv) use commas (or some other delimiter like tab spaces or semi-colons) to indicate separate values.

    - This means that .csv files can easily support multiple rows and columns of related data.

For example, the monthly precipitation values for Boulder, CO for the years 2002 and 2013 can be stored together in a comma-separated values (.csv) file, with each year of data on a separate line and each month of data within a specific year separated by commas:

1.07, 0.44, 1.50, 0.20, 3.20, 1.18, 0.09, 1.44, 1.52, 2.44, 0.78, 0.02

0.27, 1.13, 1.72, 4.14, 2.66, 0.61, 1.03, 1.40, 18.16, 2.24, 0.29, 0.5

In [5]:
from io import StringIO

- The `StringIO` module is an in-memory file-like object. 

- This object can be used as input or output to the most function that would expect a standard file object. 

- When the StringIO object is created it is initialized by passing a string to the constructer. 

- If no string is passed the StringIO will start empty. In both cases, the initial cursor on the file starts at zero.

#### Example

In [162]:
x = StringIO("1 2 3\n 4 5 6")

x_arr = np.genfromtxt(x, names="A, B, C")
x_arr

array([(1., 2., 3.), (4., 5., 6.)],
      dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

In [161]:
x_arr['A'], x_arr['B'], 

(array([1., 4.]), array([2., 5.]))

#### another example ...

In [79]:
names = ['John', 'David', 'Camel', 'Sia', 'Rakesh']

In [91]:
out = StringIO()

In [92]:
# Print these string values in a loop.
for i in range(0, 5):
    out.write(str(i))
    str_to_write = ' My name is : '+ names[i]+'\n '
    out.write(str_to_write)

In [93]:
out.getvalue()

'0 My name is : John\n 1 My name is : David\n 2 My name is : Camel\n 3 My name is : Sia\n 4 My name is : Rakesh\n '

In [89]:
out.close()

#### Another Example

In [164]:
location = r'D:\MYLEARN\DATASETS\crime_data.csv'

sample data (below)

In [165]:
dtypes=[('area',       "U30"),
        ('Murder' ,     float),
        ('Assault',     float),
        ('UrbanPop',    float),
        ('other_crime', float)]

In [166]:
data = np.genfromtxt(location, delimiter=",", skip_header=1, dtype=dtypes)

In [167]:
type(data), data.shape

(numpy.ndarray, (50,))

In [169]:
data

array([('Alabama', 4., 13.2, 236., 58.), ('Alaska', 4., 10. , 263., 48.),
       ('Arizona', 4.,  8.1, 294., 80.),
       ('Arkansas', 3.,  8.8, 190., 50.),
       ('California', 4.,  9. , 276., 91.),
       ('Colorado', 3.,  7.9, 204., 78.),
       ('Connecticut', 2.,  3.3, 110., 77.),
       ('Delaware', 4.,  5.9, 238., 72.),
       ('Florida', 4., 15.4, 335., 80.), ('Georgia', 3., 17.4, 211., 60.),
       ('Hawaii', 1.,  5.3,  46., 83.), ('Idaho', 2.,  2.6, 120., 54.),
       ('Illinois', 4., 10.4, 249., 83.),
       ('Indiana', 2.,  7.2, 113., 65.), ('Iowa', 1.,  2.2,  56., 57.),
       ('Kansas', 2.,  6. , 115., 66.), ('Kentucky', 2.,  9.7, 109., 52.),
       ('Louisiana', 4., 15.4, 249., 66.), ('Maine', 1.,  2.1,  83., 51.),
       ('Maryland', 4., 11.3, 300., 67.),
       ('Massachusetts', 3.,  4.4, 149., 85.),
       ('Michigan', 4., 12.1, 255., 74.),
       ('Minnesota', 1.,  2.7,  72., 66.),
       ('Mississippi', 4., 16.1, 259., 44.),
       ('Missouri', 3.,  9. , 178., 70.)

- Exercise : display all the area (state names)

In [175]:
print('all states names .. \n', data['area'])
print()
print('total number of states .. \n', len(data['area']))

all states names .. 
 ['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'Florida' 'Georgia' 'Hawaii' 'Idaho' 'Illinois'
 'Indiana' 'Iowa' 'Kansas' 'Kentucky' 'Louisiana' 'Maine' 'Maryland'
 'Massachusetts' 'Michigan' 'Minnesota' 'Mississippi' 'Missouri' 'Montana'
 'Nebraska' 'Nevada' 'New Hampshire' 'New Jersey' 'New Mexico' 'New York'
 'North Carolina' 'North Dakota' 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania'
 'Rhode Island' 'South Carolina' 'South Dakota' 'Tennessee' 'Texas' 'Utah'
 'Vermont' 'Virginia' 'Washington' 'West Virginia' 'Wisconsin' 'Wyoming']

total number of states .. 
 50


- Exercise : find out the state where `murder` cases are the highest

In [176]:
np.max(data['Murder'])

4.0

In [178]:
index = data['Murder'] == np.max(data['Murder'])

In [209]:
data[index]

array([('Alabama', 4., 13.2, 236., 58.), ('Alaska', 4., 10. , 263., 48.),
       ('Arizona', 4.,  8.1, 294., 80.),
       ('California', 4.,  9. , 276., 91.),
       ('Delaware', 4.,  5.9, 238., 72.),
       ('Florida', 4., 15.4, 335., 80.),
       ('Illinois', 4., 10.4, 249., 83.),
       ('Louisiana', 4., 15.4, 249., 66.),
       ('Maryland', 4., 11.3, 300., 67.),
       ('Michigan', 4., 12.1, 255., 74.),
       ('Mississippi', 4., 16.1, 259., 44.),
       ('Nevada', 4., 12.2, 252., 81.),
       ('New Mexico', 4., 11.4, 285., 70.),
       ('New York', 4., 11.1, 254., 86.),
       ('North Carolina', 4., 13. , 337., 45.),
       ('South Carolina', 4., 14.4, 279., 48.)],
      dtype=[('area', '<U30'), ('Murder', '<f8'), ('Assault', '<f8'), ('UrbanPop', '<f8'), ('other_crime', '<f8')])

- display average of Murder, Assault, UrbanPop

In [186]:
np.mean(data['Murder']), np.mean(data['Assault']), np.mean(data['UrbanPop'])

(2.72, 7.788, 170.76)

- Sort the data on Murder, Assault

In [201]:
b = np.array([[2020, 1, 11, 98],
              [2020, 2, 1,  99],
              [2021, 3, 16, 43],
              [2020, 2, 11, 54],
              [2021, 1, 13, 54],
              [2020, 1, 12, 74],
              [2021, 1, 3,  87],
              [2021, 3, 19, 23]])

In [202]:
b = b[b[:, 2].argsort()]                  # sort by day
b

array([[2020,    2,    1,   99],
       [2021,    1,    3,   87],
       [2020,    1,   11,   98],
       [2020,    2,   11,   54],
       [2020,    1,   12,   74],
       [2021,    1,   13,   54],
       [2021,    3,   16,   43],
       [2021,    3,   19,   23]])

In [203]:
b = b[b[:, 1].argsort()]                  # sort by month
b

array([[2021,    1,    3,   87],
       [2020,    1,   11,   98],
       [2020,    1,   12,   74],
       [2021,    1,   13,   54],
       [2020,    2,    1,   99],
       [2020,    2,   11,   54],
       [2021,    3,   16,   43],
       [2021,    3,   19,   23]])

In [212]:
b = b[b[:, 0].argsort()]  # sort by year
b

array([[2020,    1,   11,   98],
       [2020,    1,   12,   74],
       [2020,    2,    1,   99],
       [2020,    2,   11,   54],
       [2021,    1,    3,   87],
       [2021,    1,   13,   54],
       [2021,    3,   16,   43],
       [2021,    3,   19,   23]])

#### Another example