# Example (synthetic) data sets
The purpose of this effort to create a schema for weather data related to one location, Lamesa, TX, mentioned in the Legacy Cotton Dataset. 

In this chapter we will inspect the provided dataset and prepare it to be converted into linked data. As described earlier, linked data is data that follows a linked-data format consisting of triples where the individual parts consists of either values or identifiers (IRIs) pointing to concept definitions. 

The proces of converting data to linked data is rather straightforward and can be summed up as follows:
1. Inspect the data
2. Clean the data headers
3. Align the data with common controlled vocabularies or ontologies
4. Design linked-data shapes using the IRIs from the controlled vocabularies and ontologies
5. Transform the data into RDF

In this chapter we will focus on the first two steps.

## Dataset 2 on Weather
The first dataset is a  csv file (Dawson3242890.csv).
It contains weather data for Lamesa, TX for March1- October 30, 2010.

In [1]:
import pandas as pd
weather = pd.read_csv("data/Dawson3242890.csv",index_col=False)
weather

Unnamed: 0,STATION,NAME,DATE,DAPR,MDPR,PRCP,SNOW,SNWD,TMAX,TMIN,TOBS,WT01,WT11
0,USC00415013,"LAMESA 1 SSE, TX US",2010-03-01,,,0.44,0.0,0.0,60,36.0,36.0,,
1,USC00415013,"LAMESA 1 SSE, TX US",2010-03-02,,,0.02,0.0,0.0,41,27.0,27.0,,
2,USC00415013,"LAMESA 1 SSE, TX US",2010-03-03,,,0.00,0.0,0.0,59,27.0,33.0,,
3,USC00415013,"LAMESA 1 SSE, TX US",2010-03-04,,,0.00,0.0,0.0,68,33.0,35.0,,
4,USC00415013,"LAMESA 1 SSE, TX US",2010-03-05,,,0.00,0.0,0.0,71,33.0,43.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
240,USC00415013,"LAMESA 1 SSE, TX US",2010-10-27,,,0.00,0.0,0.0,74,40.0,47.0,,
241,USC00415013,"LAMESA 1 SSE, TX US",2010-10-28,,,0.00,0.0,0.0,76,40.0,40.0,,
242,USC00415013,"LAMESA 1 SSE, TX US",2010-10-29,,,0.00,0.0,0.0,67,32.0,32.0,,
243,USC00415013,"LAMESA 1 SSE, TX US",2010-10-30,,,0.00,0.0,0.0,73,32.0,45.0,,


In [4]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43474 entries, 0 to 43473
Data columns (total 40 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   OBJECTID            43474 non-null  int64  
 1   LocationName        43474 non-null  object 
 2   LatLong             43474 non-null  object 
 3   Latitude            43474 non-null  float64
 4   Longitude           43474 non-null  float64
 5   City                29623 non-null  object 
 6   County              43474 non-null  object 
 7   State               43474 non-null  object 
 8   Name                43474 non-null  object 
 9   TestType            43474 non-null  object 
 10  EntryNumber         43474 non-null  int64  
 11  Brand               43474 non-null  object 
 12  Trait               43473 non-null  object 
 13  Product             43474 non-null  object 
 14  Soil                32971 non-null  object 
 15  Tillage             32971 non-null  object 
 16  Plan

This dataset contains 26 fields of various datatypes. In one of the next steps this field names will be alligned with various controlled vocabularies and ontologies. For this to be succesful the fieldnames need to be as expressive as it can be. 

In [5]:
def show_difference(row):
    highlight = 'background-color: yellow;'
    default = ''
    if row['original_field'] != row['prepared_field']:
        return [default, highlight]
    else:
        return [default, default]
    
def strip_check(row):
    if row['original_field'] != row['prepared_field']:
        row["change"] = "remove trailing spaces"
        
def abbreviations(row):
    if row['original_field'] != row['prepared_field']:
        if row["change"] == "":
            row["change"] = "resolved abbreviations"
            
def removechoices(row):
    if row['original_field'] != row['prepared_field']:
        if row["change"] == "":
            row["change"] = "removed choices"

df = pd.DataFrame(columns=["original_field", "prepared_field", "change"])
for column in gdb1.columns: 
    df.loc[len(df.index)] = [column, column.strip(), ""]
    
df.apply(strip_check, axis=1)
    
df['prepared_field'] = df['prepared_field'].str.replace('adj','adjuvant')
df['prepared_field'] = df['prepared_field'].str.replace('months OS','months overall survival')
df['prepared_field'] = df['prepared_field'].str.replace('PP','pseudo-progression')
df['prepared_field'] = df['prepared_field'].str.replace('PFS','progression-free survival')

df.apply(abbreviations, axis=1)
         
df['prepared_field'] = df['prepared_field'].str.replace('(1=yes 0=no)','', regex=False)
df.apply(removechoices, axis=1)
 
df.style.set_properties(**{'text-align': 'left'})
df.style.apply(show_difference, subset=['original_field', 'prepared_field'], axis=1)

Unnamed: 0,original_field,prepared_field,change
0,OBJECTID,OBJECTID,
1,LocationName,LocationName,
2,LatLong,LatLong,
3,Latitude,Latitude,
4,Longitude,Longitude,
5,City,City,
6,County,County,
7,State,State,
8,Name,Name,
9,TestType,TestType,


In this chapter we have reviewed and possibly cleaned the source data. In the next chapter these terms will be use to identify IRIs that unambiguously point to the definitions of these field labels. In this stage of the project we need to be a bit creative here. Some crucial information such as conditionals or units have been removed, but these are needed in the semantic models that will be derived.

Moving forward the project should design a common tabular format that, next to field labels, also captures these conditionals, units and cardinality. 

Eventually this book should contain a chapter that describes this tabular format. 

Eventually the steps described in this chapter will be redundant. Moving forward performers in the different PDAs, ideally will build on a predefined tabular format where the field names are selected from the provided codebook. 

## NALT shapes and  codebook
The DT codebook will be a listing of selected field names. Data curators will be able to select field names from this codebook. Non-existing field names can be requested. 
