### Importing Libraries

In [57]:
import pandas as pd
import requests

### Pull Data from OpenWeatherMap.org Using their API

The data used was gotten from OpenWeatherMap.org. The API address in the line of code below contains the API key you can get from their website. The idea is to change the value assigned to 'q' whenever you wish to specify a place to look up its weather.

A trial is done below for Alabama.

In [58]:
js_data = requests.get('http://api.openweathermap.org/data/2.5/weather?appid=0c42f7f6b53b244c78a418f4f181282a&q=alabama').json()
print(js_data)

{'coord': {'lon': -78.39, 'lat': 43.1}, 'weather': [{'id': 521, 'main': 'Rain', 'description': 'shower rain', 'icon': '09d'}, {'id': 701, 'main': 'Mist', 'description': 'mist', 'icon': '50d'}], 'base': 'stations', 'main': {'temp': 293.56, 'pressure': 1009, 'humidity': 93, 'temp_min': 291.48, 'temp_max': 295.93}, 'visibility': 11265, 'wind': {'speed': 1.5, 'deg': 210}, 'rain': {'1h': 1.02}, 'clouds': {'all': 90}, 'dt': 1565185252, 'sys': {'type': 1, 'id': 4291, 'message': 0.009, 'country': 'US', 'sunrise': 1565172618, 'sunset': 1565224106}, 'timezone': -14400, 'id': 5118380, 'name': 'Alabama', 'cod': 200}


In the JSON data returned, we see the different dictionaries and list we have. We can take those units apart and then combine them into a pandas dataframe as a neat dataset and do so for different areas / states. That is what we have done below.

First, we created a list of places we wish to curate their weather information, and then pull out all the details we wish to have in our dataset.

In [59]:
us_states = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 
             'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas',
             'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 
            'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 
             'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island',
             'South Carolina', 'South Dakota', 'West Virginia', 'Tennessee', 'Texas', 'Utah', 'Vermont', 
             'Virginia', 'Washington', 'Wisconsin', 'Wyoming', 'London', 'Paris']




In [64]:
city_long = []
city_lat = []
city_weather = []
city_base = []
city_pressure = []
city_humidity = []
city_temp = []
city_temp_min = []
city_temp_max = []
city_visibility = []
city_windspeed = []
city_wind_deg = []
city_country = []
city_sunrise = []
city_sunset = []
city_timezone = []


for state in us_states:
    api_address='http://api.openweathermap.org/data/2.5/weather?appid=0c42f7f6b53b244c78a418f4f181282a&q='
    url = api_address + state
    json_data = requests.get(url).json()
    
    try:
        cty_lon = json_data['coord']['lon']
        city_long.append(cty_lon)
    except:
        city_long.append('NA')
    
    try:
        cty_lat = json_data['coord']['lat']
        city_lat.append(cty_lat)
    except:
        city_lat.append('NA')
    
    cty_we = json_data['weather'][0]
    cty_wea = cty_we['description']
    city_weather.append(cty_wea)
    print(state + '  --  ' + cty_wea)
    
    cty_base = json_data['base']
    city_base.append(cty_base)
    
    cty_press = json_data['main']['pressure']
    city_pressure.append(cty_press)
    
    cty_humid = json_data['main']['humidity']
    city_humidity.append(cty_humid)
    
    cty_temp = json_data['main']['temp']
    city_temp.append(cty_temp)
    
    cty_temp_min = json_data['main']['temp_min']
    city_temp_min.append(cty_temp_min)
    
    cty_temp_max = json_data['main']['temp_max']
    city_temp_max.append(cty_temp_max)
    
    try:
        cty_visi = json_data['visibility']
        city_visibility.append(cty_visi)
    except:
        city_visibility.append('NA')
    
    cty_speed = json_data['wind']['speed']
    city_windspeed.append(cty_speed)
    
    try:
        cty_deg = json_data['wind']['deg']
        city_wind_deg.append(cty_deg)
    except:
        city_wind_deg.append('NA')
    
    cty_country = json_data['sys']['country']
    city_country.append(cty_country)
    
    cty_srise = json_data['sys']['sunrise']
    city_sunrise.append(cty_srise)
    
    cty_sset = json_data['sys']['sunset']
    city_sunset.append(cty_sset)
    
    cty_tz = json_data['timezone']
    city_timezone.append(cty_tz)


    
            

Alabama  --  shower rain
Alaska  --  clear sky
Arizona  --  clear sky
Arkansas  --  overcast clouds
California  --  broken clouds
Colorado  --  clear sky
Connecticut  --  mist
Delaware  --  mist
Florida  --  broken clouds
Georgia  --  few clouds
Hawaii  --  mist
Idaho  --  clear sky
Illinois  --  clear sky
Indiana  --  broken clouds
Iowa  --  clear sky
Kansas  --  clear sky
Kentucky  --  clear sky
Louisiana  --  scattered clouds
Maine  --  broken clouds
Maryland  --  broken clouds
Massachusetts  --  broken clouds
Michigan  --  overcast clouds
Minnesota  --  clear sky
Mississippi  --  fog
Montana  --  clear sky
Nebraska  --  mist
Nevada  --  scattered clouds
New Hampshire  --  mist
New Jersey  --  broken clouds
New Mexico  --  haze
New York  --  fog
North Carolina  --  scattered clouds
North Dakota  --  overcast clouds
Ohio  --  clear sky
Oklahoma  --  mist
Oregon  --  clear sky
Pennsylvania  --  scattered clouds
Rhode Island  --  scattered clouds
South Carolina  --  clear sky
South Dak

In [70]:
us_weather_data = pd.DataFrame({'Name of City': us_states,
                                'Latitude': city_lat,
                       'Longitude': city_long,
                       'Weather': city_weather,
                        'Base': city_base,
                        'Pressure': city_pressure,
                        'Humidity': city_humidity,
                        'Temp': city_temp,
                        'Min Temp': city_temp_min,
                        'Max Temp': city_temp_max,
                        'Visibility': city_visibility,
                        'WindSpeed': city_windspeed,
                        'Wind Degree': city_wind_deg,
                        'Country': city_country,
                        'Sunrise': city_sunrise,
                        'Sunset': city_sunset,
                        'Timezone': city_timezone
                                })
print(us_weather_data.info())
us_weather_data.head(50)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 17 columns):
Name of City    51 non-null object
Latitude        51 non-null float64
Longitude       51 non-null float64
Weather         51 non-null object
Base            51 non-null object
Pressure        51 non-null float64
Humidity        51 non-null int64
Temp            51 non-null float64
Min Temp        51 non-null float64
Max Temp        51 non-null float64
Visibility      51 non-null object
WindSpeed       51 non-null float64
Wind Degree     51 non-null object
Country         51 non-null object
Sunrise         51 non-null int64
Sunset          51 non-null int64
Timezone        51 non-null int64
dtypes: float64(7), int64(4), object(6)
memory usage: 6.9+ KB
None


Unnamed: 0,Name of City,Latitude,Longitude,Weather,Base,Pressure,Humidity,Temp,Min Temp,Max Temp,Visibility,WindSpeed,Wind Degree,Country,Sunrise,Sunset,Timezone
0,Alabama,43.1,-78.39,shower rain,stations,1009.0,93,293.56,291.48,295.93,11265.0,1.5,210.0,US,1565172618,1565224106,-14400
1,Alaska,-17.38,30.08,clear sky,stations,1013.54,15,302.211,302.211,302.211,,2.72,85.629,ZW,1565151791,1565192871,7200
2,Arizona,-28.39,-49.38,clear sky,stations,1022.96,42,299.411,299.411,299.411,,0.34,293.706,BR,1565171767,1565211033,-10800
3,Arkansas,37.58,-82.73,overcast clouds,stations,1014.0,78,295.0,293.71,296.15,16093.0,3.48,260.006,US,1565174385,1565224423,-14400
4,California,25.76,-103.38,broken clouds,stations,1018.0,50,299.15,299.15,299.15,16093.0,3.6,100.0,MX,1565180564,1565228155,-18000
5,Colorado,-22.84,-51.97,clear sky,stations,1023.2,49,299.411,299.411,299.411,,5.87,37.184,BR,1565171910,1565212133,-10800
6,Connecticut,41.67,-72.67,mist,stations,1011.0,88,296.85,294.82,298.71,16093.0,1.5,210.0,US,1565171445,1565222534,-14400
7,Delaware,40.3,-83.07,mist,stations,1011.0,88,294.62,292.59,296.48,16093.0,2.66,293.689,US,1565174126,1565224844,-14400
8,Florida,-34.1,-56.21,broken clouds,stations,1020.61,89,287.611,287.611,287.611,,4.56,155.834,UY,1565173952,1565212126,-10800
9,Georgia,17.96,-76.51,few clouds,stations,1014.0,62,304.15,304.15,304.15,10000.0,7.7,130.0,JM,1565174774,1565221048,-18000


In [67]:
us_weather_data.shape
us_weather_data.columns

Index(['Name of City', 'Latitude', 'Longitude', 'Weather', 'Base', 'Pressure',
       'Humidity', 'Temp', 'Min Temp', 'Max Temp', 'Visibility', 'WindSpeed',
       'Wind Degree', 'Country', 'Sunrise', 'Sunset', 'Timezone'],
      dtype='object')

In [None]:
# The Data is already in readable format; so we move on to the next section.

### Data Transformation

The column "Base" has only one entry for all cities, and the column "visibility" has many NA values; 
so they are redundant and would be dropped.

In [73]:
us_weather_data1=us_weather_data.drop([ 'Base', 'Visibility'], axis=1)
us_weather_data1.head(3)

Unnamed: 0,Name of City,Latitude,Longitude,Weather,Pressure,Humidity,Temp,Min Temp,Max Temp,WindSpeed,Wind Degree,Country,Sunrise,Sunset,Timezone
0,Alabama,43.1,-78.39,shower rain,1009.0,93,293.56,291.48,295.93,1.5,210.0,US,1565172618,1565224106,-14400
1,Alaska,-17.38,30.08,clear sky,1013.54,15,302.211,302.211,302.211,2.72,85.629,ZW,1565151791,1565192871,7200
2,Arizona,-28.39,-49.38,clear sky,1022.96,42,299.411,299.411,299.411,0.34,293.706,BR,1565171767,1565211033,-10800


We would fill out the NA values in Wind Degree with the average, partly because
their standard deviations are not so far apart, partly because only less than 2% of the values are missing so a wrong value has a poor chance of really altering our analysis significantly and partly because some analysis requires all cell values filled up. 

In [79]:
wind_deg = []
for deg in us_weather_data1['Wind Degree']:
    try:
        wind_deg.append(float(deg))
    except:
        pass
    
def Average(lst): 
    return sum(lst) / len(lst)

print(Average(wind_deg))

227.63596


### Replacing the NA values with the Average Wind Degree

In [85]:
us_weather_data2=us_weather_data1.replace('NA', Average(wind_deg))
us_weather_data2.head()

Unnamed: 0,Name of City,Latitude,Longitude,Weather,Pressure,Humidity,Temp,Min Temp,Max Temp,WindSpeed,Wind Degree,Country,Sunrise,Sunset,Timezone
0,Alabama,43.1,-78.39,shower rain,1009.0,93,293.56,291.48,295.93,1.5,210.0,US,1565172618,1565224106,-14400
1,Alaska,-17.38,30.08,clear sky,1013.54,15,302.211,302.211,302.211,2.72,85.629,ZW,1565151791,1565192871,7200
2,Arizona,-28.39,-49.38,clear sky,1022.96,42,299.411,299.411,299.411,0.34,293.706,BR,1565171767,1565211033,-10800
3,Arkansas,37.58,-82.73,overcast clouds,1014.0,78,295.0,293.71,296.15,3.48,260.006,US,1565174385,1565224423,-14400
4,California,25.76,-103.38,broken clouds,1018.0,50,299.15,299.15,299.15,3.6,100.0,MX,1565180564,1565228155,-18000


In [86]:
us_weather_data2.isna().sum()

Name of City    0
Latitude        0
Longitude       0
Weather         0
Pressure        0
Humidity        0
Temp            0
Min Temp        0
Max Temp        0
WindSpeed       0
Wind Degree     0
Country         0
Sunrise         0
Sunset          0
Timezone        0
dtype: int64

The dataset has no missing values. We would have dropped them if it did, depending on how many of them are present.

In [89]:
us_weather_data2.shape

(51, 15)

### Handling Duplicates

In [87]:
us_weather_nodupes=us_weather_data2.drop_duplicates()
us_weather_nodupes.shape

(51, 15)

There were no duplicates in the dataset. In fact, the dataset is now nicely formatted and is ready for analysis!