<img src="../images/UBRA_Logo_DATA_TRAIN.png" style="width: 800px;">

## Working with data from ASCII files

It's nice, when we can import data from ASCII file with some nice tool like `pandas`, however there are many cases, when you have to parse data line by line by yourself.

In [72]:
f = open('../data/Bremen_tmin.txt')

`f` is now just a file handler:

In [73]:
f

<_io.TextIOWrapper name='../data/Bremen_tmin.txt' mode='r' encoding='UTF-8'>

Read lines to list

In [100]:
lines = f.readlines()
f.close()

In [75]:
lines

['# Searching for GHCND series nr GM000001474\n',
 '# coordinates:  53.10N,    8.78E,      4.0m; GHCN-D station code: GM000001474 BREMEN-SEEFAHRTSCHULE Germany\n',
 '# WMO station 10224\n',
 '# institution :: NOAA/NCEI\n',
 '# source_url :: https://catalog.data.gov/dataset/global-historical-climatology-network-daily-ghcn-daily-version-3\n',
 '# source_doi :: https://doi.org/10.7289/V5D21VHZ\n',
 '# contact_email :: ncdc.ghcnd@noaa.gov\n',
 '# reference :: Matthew J. Menne, Imke Durre, Russell S. Vose, Byron E. Gleason, and Tamara G. Houston, 2012: An Overview of the Global Historical Climatology Network-Daily Database. J. Atmos. Oceanic Technol., 29, 897-910. doi:10.1175/JTECH-D-11-00103.1.\n',
 '# license :: U.S. Government Work. The non-U.S. data cannot be redistributed within or outside of the U.S. for any commercial activities.\n',
 '# station_code :: GM000001474\n',
 '# station_name :: BREMEN-SEEFAHRTSCHULE\n',
 '# station_country :: Germany\n',
 '# wmo_code :: 10224\n',
 '# latit

In [83]:
lines[20]

'# TMIN [Celsius] daily minimum temperature\n'

In [84]:
lines[21]

' 1890  1  1     -5.50\n'

In [86]:
one_line = lines[21]
one_line

' 1890  1  1     -5.50\n'

We can separate this line in to several parts (that wil form another list)

In [88]:
one_line.split()

['1890', '1', '1', '-5.50']

In [89]:
one_line.split()[0]

'1890'

In [90]:
one_line.split()[-1]

'-5.50'

If we can split one line, we can split many in a loop. Here we use only first 10 data elements:

In [92]:
for line in lines[21:31]:
    print(line)

 1890  1  1     -5.50

 1890  1  2     -7.40

 1890  1  3     -3.50

 1890  1  4     -1.90

 1890  1  5      0.20

 1890  1  6      6.00

 1890  1  7      5.80

 1890  1  8      1.80

 1890  1  9      0.10

 1890  1 10      3.00



In [93]:
for line in lines[21:31]:
    print(line.split())

['1890', '1', '1', '-5.50']
['1890', '1', '2', '-7.40']
['1890', '1', '3', '-3.50']
['1890', '1', '4', '-1.90']
['1890', '1', '5', '0.20']
['1890', '1', '6', '6.00']
['1890', '1', '7', '5.80']
['1890', '1', '8', '1.80']
['1890', '1', '9', '0.10']
['1890', '1', '10', '3.00']


## Exersice

 - extract only temperature values from data (create empy list and append to it)

We get some values, now, let's write them down.

In [94]:
odata = [1, 2, 3, 4] # to be replaces by result of the exersise

In [95]:
fout = open('out.txt', 'w') # 'w' means file will be opened for writing

In [97]:
for record in odata:
    fout.write(str(record)+'\n')
fout.close()

In [98]:
!head out.txt

1
2
3
4


In [None]:
# %load out.txt
1
2
3
4


## Exersise

- extract years, months, days and temperature into four separate variables
- create output file that will have records of the type:
     
    YYYY:MM:DD temperature
    

- turn this into a function that takes names of the input and output files as arguments.
- try to run this function on `../data/Bremen_tmin.txt` file

How about other information in this file? How we extract data from less structured data?

In [101]:
f = open('../data/Bremen_tmin.txt')
lines = f.readlines()
f.close()

In [104]:
lines[1]

'# coordinates:  53.10N,    8.78E,      4.0m; GHCN-D station code: GM000001474 BREMEN-SEEFAHRTSCHULE Germany\n'

In [105]:
lines[1].split()

['#',
 'coordinates:',
 '53.10N,',
 '8.78E,',
 '4.0m;',
 'GHCN-D',
 'station',
 'code:',
 'GM000001474',
 'BREMEN-SEEFAHRTSCHULE',
 'Germany']

In [108]:
lines[1].split()[2]

'53.10N,'

In principle this is where [regular expressions](https://docs.python.org/3/howto/regex.html) can become useful, but we will avoid them, knowing that two last charactes always will be `N,` or `S,`.

In [109]:
lines[1].split()[2][:-2]

'53.10'

In [110]:
lines[1].split()[3][:-2]

'8.78'

In [111]:
lines[1].split()[4][:-2]

'4.0'

OK, now we know how to parce the line, but how to identify it? If there is some unique word/colelction of charactes in the file, we can always identify it:

In [112]:
our_line = lines[1]
our_line

'# coordinates:  53.10N,    8.78E,      4.0m; GHCN-D station code: GM000001474 BREMEN-SEEFAHRTSCHULE Germany\n'

In [113]:
our_line.startswith('# coordinates:')

True

In [114]:
'# coordinates:' in our_line

True

In [115]:
'# coordinates!' in our_line

False

In [116]:
f = open('../data/Bremen_tmin.txt')
lines = f.readlines()
f.close()
for line in lines:
    if line.startswith('# coordinates:'):
        lat = lines[1].split()[2][:-2]
        lon = lines[1].split()[3][:-2]
        alt = lines[1].split()[4][:-2]
        
print('Coordinates of the station')
print(f'lon:{lon}    lat:{lat}    alt:{alt}')

Coordinates of the station
lon:8.78    lat:53.10    alt:4.0


## Exersise

- Write function that will read information about the file (path to the file used as an argument) in the form of:


    Station name: XXXX
    WMO number:
    Coordinates:  lon(XX) lat(XX) alt(XX)
    History: downloaded at: YYYY:MM:DD

