Archive files are missing data #19

yuwenpeng · 2017-02-08T17:36:53Z

Hi,

We found some of PMU's data are not archived to d2(i) files. Some of the historical data can't be found no matter via API or 'Historian Data Viewer'. The missing data may last from 1 second to around 10 seconds, and will happen several times every hour.
Then we added an action to openHistorian to count how many data are received in every frame. It shows all data and all frames are received by the action, which means openHistorian received the data. But some of the frames are not archived. We retrieved the data several hours later than the time when the data were received by openHistorian, but some of them can't be found.
In a short word, the data has 36,000 frames/hour, and all frames are received by openHistorian, but around 400 frames/hour in average can't be found via API or 'HistorianView.exe'
What should we do to fix this issue? Could it be something wrong with configurations?
We are running openHistorian 2.0.415 on Windows Server 2012 with .Net framework 4.5 and openHistorian 2.1 release on Windows 10 with .NET framework 4.6. Both of them are suffering missing data.

Best wishes,
Wenpeng

ritchiecarroll · 2017-02-08T17:39:50Z

What version of the openHistorian are you using?

How is the data flowing into the openHistorian?

Is this PMU data?

Thanks,
Ritchie

yuwenpeng · 2017-02-08T18:00:36Z

@ritchiecarroll ,
Thank you Ritchie!
We are running openHistorian 2.0.415 on Windows Server 2012 with .Net framework 4.5 and openHistorian 2.1 release on Windows 10 with .NET framework 4.6. Both of them are suffering from missing data.
It is FDR's data using FNet protocol. We have a self-developed server which receives data from FDR directly, then forward data to openHistorian. The 'ConnectionString' of devices in openHistorian configuration database is 'transportprotocol=Tcp;......'. According to the action, all data is received by openHistorian.

What version of the openHistorian are you using?
How is the data flowing into the openHistorian?
Is this PMU data?

ritchiecarroll · 2017-02-08T18:06:58Z

So you have developed a custom action adapter that runs inside the openHistorian that receives data from the FDRs? Or are you using the built-in FNET device protocol?

The only thing I can think of is that if the timestamps of the incoming data were duplicated - these points would be harder to extract since the historian "key" is based on ID, Timestamp then a counter.

yuwenpeng · 2017-02-08T18:57:13Z

@ritchiecarroll
A custom action adapter runs inside the openHistorian. The adapter receives data from openHistorian by implementing the function ConcentratorBase.PublishFrame(IFrame frame, int index).

So you have developed a custom action adapter that runs inside the openHistorian that receives data from the FDRs? Or are you using the built-in FNET device protocol?

yuwenpeng · 2017-02-08T19:12:41Z

@ritchiecarroll
In the custom action adapter, I write the timestamp and values count of every IFrame into a text file. I didn't find any duplicated timestamp.

The only thing I can think of is that if the timestamps of the incoming data were duplicated - these points would be harder to extract since the historian "key" is based on ID, Timestamp then a counter.

ritchiecarroll · 2017-02-08T19:59:54Z

I am assuming the actual F-NET data comes in using a standard device connection using the built-in UTK FNET protocol?

If so, the steps below will help us see if the outage is due to a re-connect - it could be that there is a delay in the source data stream and device simply reconnects, a standard operation when data stops flowing. By default this timeout is set 5 seconds which may be too short and the device is reconnecting. By the way, this setting is called "Data Loss Interval" and can be found on the openHistorian Manager device configuration screen.

From the openHistorian machine connected to the remote FNET streams, run the “Statistics Trending Tool” – as named in the start menu, actual EXE name is StatisticView.exe in the openHistorian installation folder. Once running, connect to the Statistics historian – note that application should already default to this archive when no other local archive is installed, but just in case, default statistics archive path is “C:\Program Files\openHistorian\Statistics\” – this assuming openHistorian was installed on C: drive.

Now, for the connected FNET device, find the statistic that ends with “!IS:ST8”, i.e., input stream statistic number 8 which defines a boolean value representing if the input stream was continually connected during last reporting interval. For statistics the default reporting interval is every 10 seconds, so if the stat is non-zero the device was continually connected over the 10 second period.

If you select this statistic and trend this value over the time window for the data gap you found, you will then know if the device connection was interrupted and hence caused the data gap.

If the device connection was interrupted, you can go through logs (using “Log File Viewer” application) and check the messages for the device around the same timeframe – from this you can get a clue if the connection was terminated by the remote device or reset because data stopped flowing, this often due to a change in network, e.g., router reboot, etc.

If this is not the case then there may be something else going on and I would suggest carefully monitoring the logs for errors or connection issues.

Thanks,
Ritchie

yuwenpeng · 2017-02-08T20:57:11Z

@ritchiecarroll
Thank you for your kindly help!
All the devices use built-in UTK FNet protocol.
One of the time windows of missing data is 02/08/2017 0:02:39 - 0:02:42. I picked four devices which have missing data, as shown below. It seems all of them were continually connected in the time window.

Actually, almost all devices (around 100) lost data in that time window, and we didn't find any connection status change. The status was updated at 0:02:26, 36, 46, 56... And the action adapter received all the data.

ritchiecarroll · 2017-02-08T21:33:00Z

We may want you to install a newer build so that we can have enhanced logging on that we can evaluate when a gap occurs. Would this be possible? If so you can update existing version with a nightly build:

https://www.gridprotectionalliance.org/NightlyBuilds/openHistorian/Beta/openHistorian.Installs.zip

With the new version we have a new detailed logging system that will provide much more detail. Also, new archive log files are time-stamped so that we can find overlapping logs around time of detected data gap. After installation the log files will end up in the "C:\Program Files\openHistorian\Logs" folder.

Thanks,
Ritchie

ritchiecarroll · 2017-02-08T21:34:02Z

Also - what about data quality? Are we sure received data is not NaN? Are you writing captured data from your custom adapter to a file that we could review?

Thanks,
Ritchie

yuwenpeng · 2017-02-08T22:04:44Z

@ritchiecarroll
That's great! Let me install this newer build to get detailed log information.

Also - what about data quality? Are we sure received data is not NaN? Are you writing captured data from your custom adapter to a file that we could review?

Right now the adapter didn't write data to files. Only the timestamp and value count are captured. Since we have another self-developed server receiving the same forwarded data and didn't have any problem, the data quality should be good. To make it 100% sure, I'll write all the data to files after installing the aforementioned build.
Thank you very much!
Wenpeng

yuwenpeng · 2017-02-09T02:10:39Z

@ritchiecarroll
With the new build, I got another time window of missing data [UTC 2/8/2017 23:18:20.5, 23:18:22.7]

Didn't find any connection status change.

The custom action adapter received all the data, and the data quality is good

The value counts (the last column) received by adapter didn't change significantly, around 80 to 88, in several hours.

It seems all data is received by openHistorian and the data quality is good. But something stopped openHistorian from archiving some of time frames.

ritchiecarroll · 2017-02-09T02:14:57Z

OK, thanks. Let's get a copy of the log files around that time frame and see what interesting was going on.

yuwenpeng · 2017-02-09T02:33:06Z

@ritchiecarroll
Here is the local time (column 0) and UTC time (column 1)

This is the screenshot of LogfileViewer. Usually there are at lease 5 log messages in 10 seconds. But there isn't any log from 18:18:18 - 18:18:28. The log file is also attached.

This is the log file. Please change the extension back to .logz. I'm not allowed to upload logz or zip file...
20170208-225631-007026 - openHistorian - 1.pdf

Is there any other log files which may have useful information?

StevenChisholm · 2017-02-09T03:10:21Z

Yes, just as I suspected. Your system undergoes a very long pause. Possibly 10 seconds in duration. This causes measurements to not be received in the Lag time window specified in the concentration adapter. We used to have this problem a ton at OGE, but have turned on some optimizations so this doesn't happen. To verify this is the root cause, turn on ThreadPoolMonitoring and restart openHistorian. You can do this by modifying your openhistorian.exe.config file. Find the line that says OptimizationsConnectionString and add EnableThreadPoolMonitoring=; It should look like this: <add name="OptimizationsConnectionString" value="EnableThreadPoolMonitoring=;" description="Specifies which optimizations to enable for the system." encrypted="false" />

…

On Wed, Feb 8, 2017 at 8:48 PM, J. Ritchie Carroll ***@***.*** > wrote: I don't see the log file attached? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARkyJMdS5OmfFs-fMWhqYBz_AEP-YVvhks5ran5kgaJpZM4L7IEB> .

ritchiecarroll · 2017-02-09T03:31:22Z

Looks like there is a pause, may be Garabage Collection related and we will want to validate GC settings in your config file (i.e., openHistorian.exe.config).

Regardless, the source data stream and device simply reconnects when no data is received during pause, again, a standard operation when data stops flowing. By default this timeout is set 5 seconds which must be too short and the device is reconnecting. This setting is called "Data Loss Interval" and can be found on the openHistorian Manager device configuration screen.

Suggest changing this to 15 seconds for each input device to see if data loss stops.

Thanks,
Ritchie

yuwenpeng · 2017-02-09T04:07:07Z

Added EnableThreadPoolMonitoring=;, and set 'Data loss interval' to 15 by modifying table 'Device' in configuration database directly. And openHistorian is restarted. Hope it could help.

To verify this is the root cause, turn on ThreadPoolMonitoring and restart openHistorian. You can do this by modifying your openhistorian.exe.config file. Find the line that says OptimizationsConnectionString and add EnableThreadPoolMonitoring=;

Suggest changing this to 15 seconds for each input device to see if data loss stops.

I still have some questions. According to the custom action adapter and statistic database, all the data is received by openHistorian and all devices are continually connected. And, if there is a pause, we should observe an obvious change in the time difference between time stamp (UTC) and the log time, which is the local time when the custom adapter received the time frame. But according to the log files the time difference is 2.3 seconds and didn't change.

ritchiecarroll · 2017-02-09T13:49:58Z

I agree - that's what is strange about all this. Also, there are some very unusual messages in your error log. Perhaps the next best step is to have a WebEx and take a look?

Want to send me an e-mail about scheduling this?

Thanks,
Ritchie

yuwenpeng · 2017-02-09T17:50:33Z

@ritchiecarroll
I modified the configuration as you and Steven Chisholm suggested and openHistorian has run several hours. There are still missing data.
Thank you for taking time to help us! I'm flexible to have WebEx. What's your available time?

ritchiecarroll · 2017-02-09T18:05:01Z

Tomorrow afternoon? Please send me an e-mail at my GPA e-mail address and we will arrange this...

yuwenpeng · 2017-02-09T23:27:20Z

@ritchiecarroll @StevenChisholm
I'm digging into the source code, and traced down to GSF.Snap.Collection.SortedPointBuffer<TKey,TValue>.TryEnqueue(Tkey,TValue> for PrebufferWriter
It seems the SortedPointBuffer received all the data. I need to go deeper to find where the data get lost..

yuwenpeng · 2017-02-11T18:25:24Z

@ritchiecarroll @StevenChisholm
It seems sometimes data may get lost when creating UnionTreeStream<TKey, TValue>
UnionTreeStream<TKey, TValue> is used to merge PendingTable to the next level or rollover pending tables from all levels to the NextStage archive. But sometimes the merging procedure stops unexpectedly and the remaining data gets lost. One example is shown below:
10 PendingTable2 are merged to PendingTable3. There are 53021 values in the 10 PendingTable2, but only 15674 of them are copied to PendingTable3. This could also happen in the rollover from FirstStage to NextStage archive.
Still don't know what is the reason which stops the merging procedure.

yuwenpeng · 2017-02-11T20:43:43Z

@ritchiecarroll @StevenChisholm
Duplicated data may be the reason. But not sure yet..

yuwenpeng · 2017-02-12T02:58:56Z

@StevenChisholm
@ritchiecarroll
I think I got the bug. It is in UnionTreeStreamSortHelper.SortAssumingIncreased(int index), at line 116 as shown below.
.
x should not start from 0 but index. Once there are duplicated keys in BufferedTreeStreams, and the duplicated key happens to be the last key of the second (or third and so on) BufferedTreeStream, this sort function will be called and make mistake.

One of the captured examples is shown below, in which 2017-02-12 00:37:22.3000000/785 is the duplicated key.
There are 3 sections:

The original order of BufferedTreeStreams and their cache key and valid flag before removing duplicated key.
Information of the streams after removing duplicated key by advancing the position of the second duplicated key.
Information and order of streams after sorting the streams.
It caused 3 remaining streams ignored.

Then the data get lost.

StevenChisholm · 2017-02-12T15:04:06Z

Yes, I have confirmed from a code review that x should start from index and not 0. The consequence of not doing this will cause the streams with the later values to be ignored if the duplicate value happened to be at the end of the stream. Since in general, we are not working with many duplicate values, we never had this issue crop up before. Let us know if this issue solves the problem and we will update the code.

…

On Sat, Feb 11, 2017 at 8:58 PM, yuwenpeng ***@***.***> wrote: @StevenChisholm <https://github.com/StevenChisholm> @ritchiecarroll <https://github.com/ritchiecarroll> I think I got the bug. It is in UnionTreeStreamSortHelper.SortAssumingIncreased(int index), at line 116 as shown below. [image: image] <https://cloud.githubusercontent.com/assets/25643171/22859170/f760a7e0-f0a0-11e6-8747-45f814ca2863.png> . x should not start from 0 but index. Once there are duplicated keys in BufferedTreeStreams, and the duplicated key happens to be the last key of the second (or third and so on) BufferedTreeStream, the sort function will make mistake. One of the captured examples is shown below, in which 2017-02-12 00:37:22.3000000/785 is the duplicated key. There are 3 sections: 1. The original order of BufferedTreeStreams and their cache key and valid flag before removing duplicated key. 2. Information of the streams after removing duplicated key by advancing the position of the second duplicate entry. 3. Information and order of streams after sorting the streams. It caused 3 remaining streams ignored. [image: image] <https://cloud.githubusercontent.com/assets/25643171/22859212/bb2f8cda-f0a2-11e6-96b6-124a3c95ba19.png> Then the data get lost. [image: image] <https://cloud.githubusercontent.com/assets/25643171/22859292/bbc6565e-f0a4-11e6-8257-fa68ded27a98.png> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARkyJHR7fgMwcMEH_MGbdweB72yq216Dks5rbnVygaJpZM4L7IEB> .

yuwenpeng · 2017-02-12T16:49:41Z

The problem was solved.
Sometimes meter may resend data due to network issue, this is why we get duplicated key.
Previously the data losing happened around 7 times every hour in average. Now there isn't any missing data in the last 15 hours.
Thank you for all your help! @ritchiecarroll @StevenChisholm

Best wishes,
Wenpeng Yu

StevenChisholm · 2017-02-12T18:58:19Z

No, thank you. You found the first major bug in the historian since it's beta release 3 years ago. (At least in the core engine part that I wrote.) I've checked in the fix. It should be available in tomorrow's nightly build. 437ed9d

…

On Sun, Feb 12, 2017 at 10:49 AM, yuwenpeng ***@***.***> wrote: The problem was solved. Previously the data losing happened around 7 times every hour in average. Now there isn't any missing data in the last 15 hours. Thank you for all your help! @ritchiecarroll <https://github.com/ritchiecarroll> @StevenChisholm <https://github.com/StevenChisholm> Best wishes, Wenpeng Yu — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARkyJOSAHBhuRR-kPappOTC7cI0dgPzUks5rbzgmgaJpZM4L7IEB> .

ritchiecarroll · 2017-02-12T23:37:27Z

I'll add my thank you as well - nice job @yuwenpeng! I'll close this for now given the discovered fix:
437ed9d

Thanks again!
Ritchie

ritchiecarroll closed this as completed Feb 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archive files are missing data #19

Archive files are missing data #19

yuwenpeng commented Feb 8, 2017 •

edited

Loading

ritchiecarroll commented Feb 8, 2017

yuwenpeng commented Feb 8, 2017

ritchiecarroll commented Feb 8, 2017 •

edited

Loading

yuwenpeng commented Feb 8, 2017

yuwenpeng commented Feb 8, 2017

ritchiecarroll commented Feb 8, 2017 •

edited

Loading

yuwenpeng commented Feb 8, 2017

ritchiecarroll commented Feb 8, 2017

ritchiecarroll commented Feb 8, 2017

yuwenpeng commented Feb 8, 2017 •

edited

Loading

yuwenpeng commented Feb 9, 2017

ritchiecarroll commented Feb 9, 2017

yuwenpeng commented Feb 9, 2017 •

edited

Loading

StevenChisholm commented Feb 9, 2017 via email

ritchiecarroll commented Feb 9, 2017 •

edited

Loading

yuwenpeng commented Feb 9, 2017 •

edited

Loading

ritchiecarroll commented Feb 9, 2017

yuwenpeng commented Feb 9, 2017 •

edited

Loading

ritchiecarroll commented Feb 9, 2017 •

edited

Loading

yuwenpeng commented Feb 9, 2017 •

edited

Loading

yuwenpeng commented Feb 11, 2017 •

edited

Loading

yuwenpeng commented Feb 11, 2017

yuwenpeng commented Feb 12, 2017 •

edited

Loading

StevenChisholm commented Feb 12, 2017 via email

yuwenpeng commented Feb 12, 2017 •

edited

Loading

StevenChisholm commented Feb 12, 2017 via email

ritchiecarroll commented Feb 12, 2017

Archive files are missing data #19

Archive files are missing data #19

Comments

yuwenpeng commented Feb 8, 2017 • edited Loading

ritchiecarroll commented Feb 8, 2017

yuwenpeng commented Feb 8, 2017

ritchiecarroll commented Feb 8, 2017 • edited Loading

yuwenpeng commented Feb 8, 2017

yuwenpeng commented Feb 8, 2017

ritchiecarroll commented Feb 8, 2017 • edited Loading

yuwenpeng commented Feb 8, 2017

ritchiecarroll commented Feb 8, 2017

ritchiecarroll commented Feb 8, 2017

yuwenpeng commented Feb 8, 2017 • edited Loading

yuwenpeng commented Feb 9, 2017

ritchiecarroll commented Feb 9, 2017

yuwenpeng commented Feb 9, 2017 • edited Loading

StevenChisholm commented Feb 9, 2017 via email

ritchiecarroll commented Feb 9, 2017 • edited Loading

yuwenpeng commented Feb 9, 2017 • edited Loading

ritchiecarroll commented Feb 9, 2017

yuwenpeng commented Feb 9, 2017 • edited Loading

ritchiecarroll commented Feb 9, 2017 • edited Loading

yuwenpeng commented Feb 9, 2017 • edited Loading

yuwenpeng commented Feb 11, 2017 • edited Loading

yuwenpeng commented Feb 11, 2017

yuwenpeng commented Feb 12, 2017 • edited Loading

StevenChisholm commented Feb 12, 2017 via email

yuwenpeng commented Feb 12, 2017 • edited Loading

StevenChisholm commented Feb 12, 2017 via email

ritchiecarroll commented Feb 12, 2017

yuwenpeng commented Feb 8, 2017 •

edited

Loading

ritchiecarroll commented Feb 8, 2017 •

edited

Loading

ritchiecarroll commented Feb 8, 2017 •

edited

Loading

yuwenpeng commented Feb 8, 2017 •

edited

Loading

yuwenpeng commented Feb 9, 2017 •

edited

Loading

ritchiecarroll commented Feb 9, 2017 •

edited

Loading

yuwenpeng commented Feb 9, 2017 •

edited

Loading

yuwenpeng commented Feb 9, 2017 •

edited

Loading

ritchiecarroll commented Feb 9, 2017 •

edited

Loading

yuwenpeng commented Feb 9, 2017 •

edited

Loading

yuwenpeng commented Feb 11, 2017 •

edited

Loading

yuwenpeng commented Feb 12, 2017 •

edited

Loading

yuwenpeng commented Feb 12, 2017 •

edited

Loading