## To support large messages with kafka:

You need to adjust three (or four) properties:

Consumer side:fetch.message.max.bytes - this will determine the largest size of a message that can be fetched by the consumer.<br>
Broker side: replica.fetch.max.bytes - this will allow for the replicas in the brokers to send messages within the cluster and make sure the messages are replicated correctly. If this is too small, then the message will never be replicated, and therefore, the consumer will never see the message because the message will never be committed (fully replicated).
Broker side: message.max.bytes - this is the largest size of the message that can be received by the broker from a producer.
Broker side (per topic): max.message.bytes - this is the largest size of the message the broker will allow to be appended to the topic. This size is validated pre-compression. (Defaults to broker's message.max.bytes.)

## states structure:

OpenSkyStates : time (long) , states (list of StateVector)

StateVector : 
<table align="left">
 <tr><td>icao24         </td><td>str</td></tr>
 <tr><td>origin_country </td><td>str</td></tr>
 <tr><td>time_position  </td><td>timestamp</td></tr>
 <tr><td>last_contact   </td><td>timestamp</td></tr>
 <tr><td>longitude      </td><td>float</td></tr>
 <tr><td>latitude       </td><td>float</td></tr>
 <tr><td>baro_altitude  </td><td>float</td></tr>
 <tr><td>on_ground      </td><td>bool</td></tr>
 <tr><td>velocity       </td><td>float</td></tr>
 <tr><td>heading        </td><td>float</td></tr>
 <tr><td>vertical_rate  </td><td>float</td></tr>
 <tr><td>sensors        </td><td>NoneType</td></tr>
 <tr><td>geo_altitude   </td><td>float</td></tr>
 <tr><td>squawk         </td><td>str</td></tr>
 <tr><td>spi            </td><td>bool</td></tr>
 <tr><td>position_source</td><td>int</td></tr>  
</table>
   

## What fields used in json for state_vector entity:
<table align="left">
  <tr><td>"time", T.TimestampType()</td></tr>
  <tr><td>"icao24", T.StringType()</td></tr>
  <tr><td>"callsign", T.StringType()</td></tr>
  <tr><td>"last_contact", T.TimestampType()</td></tr>
  <tr><td>"longitude", T.FloatType()</td></tr>
  <tr><td>"latitude", T.FloatType()</td></tr>
  <tr><td>"baro_altitude", T.FloatType()</td></tr>
  <tr><td>"on_ground", T.IntegerType()</td></tr>
  <tr><td>"velocity", T.FloatType()</td></tr>
  <tr><td>"geo_altitude", T.FloatType()</td></tr>
  <tr><td>"squawk", T.StringType()</td></tr>
  <tr><td>"position_source", T.IntegerType())</td></tr>
</table>

## Hourly, Daily and Weekly tables:

CREATE EXTERNAL TABLE opensky_network.states_last_hour
(time TIMESTAMP, icao24 STRING, callsign STRING, last_contact TIMESTAMP,
longitude FLOAT, latitude FLOAT, baro_altitude FLOAT, on_ground INT,   
velocity FLOAT, geo_altitude FLOAT, squawk STRING, position_source INT)
Partitioned By (date_minute string)
STORED AS PARQUET 
LOCATION '/user/naya/FinalProject/last_hour';


CREATE EXTERNAL TABLE opensky_network.states_last_day
(time TIMESTAMP, icao24 STRING, callsign STRING, last_contact TIMESTAMP,
longitude FLOAT, latitude FLOAT, baro_altitude FLOAT, on_ground INT,   
velocity FLOAT, geo_altitude FLOAT, squawk STRING, position_source INT)
Partitioned By (date_hour string)
STORED AS PARQUET 
LOCATION '/user/naya/FinalProject/last_day';


CREATE EXTERNAL TABLE opensky_network.states_last_week
(time TIMESTAMP, icao24 STRING, callsign STRING, last_contact TIMESTAMP,
longitude FLOAT, latitude FLOAT, baro_altitude FLOAT, on_ground INT,   
velocity FLOAT, geo_altitude FLOAT, squawk STRING, position_source INT)
Partitioned By (date_day string)
STORED AS PARQUET 
LOCATION '/user/naya/FinalProject/last_week';

## HDFS settings
See storage consumption :

    <i>sudo -u hdfs hdfs dfs -du -h /FinalProject</i>
    
Lower replication level (to save space):

    <i>sudo -u hdfs hdfs dfs -setrep -R 1 /FinalProject/Archive</i>
    


## More interesting stuff

<ol>
    <li>Spark Structured Streaming and Bool values:  <br>      
        in python bool is True / False, in Spark Types, boolean is 0 / 1 </li>
    <li>Saving raw JSON in hdfs requires space more than 10 times of same data saved in paruqet format </li>
    <li>Supporting messages in kafka larger than 1MB - see setion above "To support large messages with kafka</li>
    <li>It probably would be smarter to send each state vector sas a seperate message and not all together in the list each cycle. Because later when it is turned into a dataframe it's harder to turn it back into array of json's again and send the entire batch</li>
</ol>