The purpose of the notebook is to use machine learning to see if I can come up with a predictor for the change in [SymH](http://wdc.kugi.kyoto-u.ac.jp/aeasy/). What I expect to see is that the best predictor will be IMF Bz, which we know that southward turnings can drive geomagnetic storms. 

First this code will read in the SW data files into a single pandas data frame that can be used for the ML programs. The SW data files from CDAWeb has additional information and meta data to make sure the files are useful for future generations. I'll just pull our the stuff that is relevant for my playing here, but anyone interested can read all you want at the [CDAWeb site](https://cdaweb.gsfc.nasa.gov/index.html/). I am going to grab the data from OMNI (shifted solar wind data) and the following fields:

- time (this is automatically included)
- Bx
- By
- Bz
- Flow speed
- vx
- vy
- vz
- proton density
- flow pressure
- SymH

I am also going to use 1 minute time resolution. Note that there will be data gaps in some of the fields, so I'll have to handle that as well. I have been downloading these files by hand. I had a code that would go to the website and download data files, but it recently started giving me errors. I believe there may have been a formatting change to the website, so I will have to go look at that later.

In [69]:
import pandas as pd
from os import listdir
from os.path import isfile, join

Now read in the contents of the data directory. I have a number of *.txt files, so I want to read though the list of the files, read in each file, and concatenate them together.

In [70]:
mypath = "../SW_data/"
files = [f for f in listdir(mypath) if isfile(join(mypath, f))]

In [71]:
data = pd.read_csv(mypath + files[0], sep = '\s+', engine='python',skiprows = 73, skipfooter=3, header=None, 
                   names=['date', 'time', 'bx', 'by', 'bz', 'flow_speed', 
                          'vx', 'vy', 'vz', 'H_den', 'Pdyn', 'SymH'])
for f in files[1:]:
    data2 = pd.read_csv(mypath + f, sep = '\s+', engine='python', skiprows = 73, skipfooter = 3, header=None, 
                                           names=['date', 'time', 'bx', 'by', 'bz', 'flow_speed', 
                                                  'vx', 'vy', 'vz', 'H_den', 'Pdyn', 'SymH'])
    data = pd.concat([data, data2])
    

Now that we have the data read in, we will need to handle the bad data. These will be for times that for whatever reason there is no data. Luckily these are already flagged for us, sort of. The followin values indicate bad data:

- Magnetic field, 9999.99
- Flow speed, 99999.9
- velocity vector, 99999.9
- Proton density, 999.99
- Dynamic pressure, 99.99
- SymH, 99999