Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive files are missing data #19

Closed
yuwenpeng opened this issue Feb 8, 2017 · 27 comments
Closed

Archive files are missing data #19

yuwenpeng opened this issue Feb 8, 2017 · 27 comments

Comments

@yuwenpeng
Copy link

yuwenpeng commented Feb 8, 2017

Hi,

We found some of PMU's data are not archived to d2(i) files. Some of the historical data can't be found no matter via API or 'Historian Data Viewer'. The missing data may last from 1 second to around 10 seconds, and will happen several times every hour.
Then we added an action to openHistorian to count how many data are received in every frame. It shows all data and all frames are received by the action, which means openHistorian received the data. But some of the frames are not archived. We retrieved the data several hours later than the time when the data were received by openHistorian, but some of them can't be found.
In a short word, the data has 36,000 frames/hour, and all frames are received by openHistorian, but around 400 frames/hour in average can't be found via API or 'HistorianView.exe'
What should we do to fix this issue? Could it be something wrong with configurations?
We are running openHistorian 2.0.415 on Windows Server 2012 with .Net framework 4.5 and openHistorian 2.1 release on Windows 10 with .NET framework 4.6. Both of them are suffering missing data.

Best wishes,
Wenpeng

@ritchiecarroll
Copy link
Member

What version of the openHistorian are you using?

How is the data flowing into the openHistorian?

Is this PMU data?

Thanks,
Ritchie

@yuwenpeng
Copy link
Author

@ritchiecarroll ,
Thank you Ritchie!
We are running openHistorian 2.0.415 on Windows Server 2012 with .Net framework 4.5 and openHistorian 2.1 release on Windows 10 with .NET framework 4.6. Both of them are suffering from missing data.
It is FDR's data using FNet protocol. We have a self-developed server which receives data from FDR directly, then forward data to openHistorian. The 'ConnectionString' of devices in openHistorian configuration database is 'transportprotocol=Tcp;......'. According to the action, all data is received by openHistorian.

What version of the openHistorian are you using?
How is the data flowing into the openHistorian?
Is this PMU data?

@ritchiecarroll
Copy link
Member

ritchiecarroll commented Feb 8, 2017

So you have developed a custom action adapter that runs inside the openHistorian that receives data from the FDRs? Or are you using the built-in FNET device protocol?

The only thing I can think of is that if the timestamps of the incoming data were duplicated - these points would be harder to extract since the historian "key" is based on ID, Timestamp then a counter.

@yuwenpeng
Copy link
Author

@ritchiecarroll
A custom action adapter runs inside the openHistorian. The adapter receives data from openHistorian by implementing the function ConcentratorBase.PublishFrame(IFrame frame, int index).

So you have developed a custom action adapter that runs inside the openHistorian that receives data from the FDRs? Or are you using the built-in FNET device protocol?

@yuwenpeng
Copy link
Author

@ritchiecarroll
In the custom action adapter, I write the timestamp and values count of every IFrame into a text file. I didn't find any duplicated timestamp.

The only thing I can think of is that if the timestamps of the incoming data were duplicated - these points would be harder to extract since the historian "key" is based on ID, Timestamp then a counter.

@ritchiecarroll
Copy link
Member

ritchiecarroll commented Feb 8, 2017

I am assuming the actual F-NET data comes in using a standard device connection using the built-in UTK FNET protocol?

If so, the steps below will help us see if the outage is due to a re-connect - it could be that there is a delay in the source data stream and device simply reconnects, a standard operation when data stops flowing. By default this timeout is set 5 seconds which may be too short and the device is reconnecting. By the way, this setting is called "Data Loss Interval" and can be found on the openHistorian Manager device configuration screen.

From the openHistorian machine connected to the remote FNET streams, run the “Statistics Trending Tool” – as named in the start menu, actual EXE name is StatisticView.exe in the openHistorian installation folder. Once running, connect to the Statistics historian – note that application should already default to this archive when no other local archive is installed, but just in case, default statistics archive path is “C:\Program Files\openHistorian\Statistics\” – this assuming openHistorian was installed on C: drive.

Now, for the connected FNET device, find the statistic that ends with “!IS:ST8”, i.e., input stream statistic number 8 which defines a boolean value representing if the input stream was continually connected during last reporting interval. For statistics the default reporting interval is every 10 seconds, so if the stat is non-zero the device was continually connected over the 10 second period.

If you select this statistic and trend this value over the time window for the data gap you found, you will then know if the device connection was interrupted and hence caused the data gap.

If the device connection was interrupted, you can go through logs (using “Log File Viewer” application) and check the messages for the device around the same timeframe – from this you can get a clue if the connection was terminated by the remote device or reset because data stopped flowing, this often due to a change in network, e.g., router reboot, etc.

If this is not the case then there may be something else going on and I would suggest carefully monitoring the logs for errors or connection issues.

Thanks,
Ritchie

@yuwenpeng
Copy link
Author

@ritchiecarroll
Thank you for your kindly help!
All the devices use built-in UTK FNet protocol.
One of the time windows of missing data is 02/08/2017 0:02:39 - 0:02:42. I picked four devices which have missing data, as shown below. It seems all of them were continually connected in the time window.
image
Actually, almost all devices (around 100) lost data in that time window, and we didn't find any connection status change. The status was updated at 0:02:26, 36, 46, 56... And the action adapter received all the data.
image
image

@ritchiecarroll
Copy link
Member

We may want you to install a newer build so that we can have enhanced logging on that we can evaluate when a gap occurs. Would this be possible? If so you can update existing version with a nightly build:

https://www.gridprotectionalliance.org/NightlyBuilds/openHistorian/Beta/openHistorian.Installs.zip

With the new version we have a new detailed logging system that will provide much more detail. Also, new archive log files are time-stamped so that we can find overlapping logs around time of detected data gap. After installation the log files will end up in the "C:\Program Files\openHistorian\Logs" folder.

Thanks,
Ritchie

@ritchiecarroll
Copy link
Member

Also - what about data quality? Are we sure received data is not NaN? Are you writing captured data from your custom adapter to a file that we could review?

Thanks,
Ritchie

@yuwenpeng
Copy link
Author

yuwenpeng commented Feb 8, 2017

@ritchiecarroll
That's great! Let me install this newer build to get detailed log information.

Also - what about data quality? Are we sure received data is not NaN? Are you writing captured data from your custom adapter to a file that we could review?

Right now the adapter didn't write data to files. Only the timestamp and value count are captured. Since we have another self-developed server receiving the same forwarded data and didn't have any problem, the data quality should be good. To make it 100% sure, I'll write all the data to files after installing the aforementioned build.
Thank you very much!
Wenpeng

@yuwenpeng
Copy link
Author

@ritchiecarroll
With the new build, I got another time window of missing data [UTC 2/8/2017 23:18:20.5, 23:18:22.7]
image

Didn't find any connection status change.
image

The custom action adapter received all the data, and the data quality is good
image
The value counts (the last column) received by adapter didn't change significantly, around 80 to 88, in several hours.
image

It seems all data is received by openHistorian and the data quality is good. But something stopped openHistorian from archiving some of time frames.

@ritchiecarroll
Copy link
Member

OK, thanks. Let's get a copy of the log files around that time frame and see what interesting was going on.

@yuwenpeng
Copy link
Author

yuwenpeng commented Feb 9, 2017

@ritchiecarroll
Here is the local time (column 0) and UTC time (column 1)
image
This is the screenshot of LogfileViewer. Usually there are at lease 5 log messages in 10 seconds. But there isn't any log from 18:18:18 - 18:18:28. The log file is also attached.

capture

This is the log file. Please change the extension back to .logz. I'm not allowed to upload logz or zip file...
20170208-225631-007026 - openHistorian - 1.pdf

Is there any other log files which may have useful information?

@StevenChisholm
Copy link
Contributor

StevenChisholm commented Feb 9, 2017 via email

@ritchiecarroll
Copy link
Member

ritchiecarroll commented Feb 9, 2017

Looks like there is a pause, may be Garabage Collection related and we will want to validate GC settings in your config file (i.e., openHistorian.exe.config).

Regardless, the source data stream and device simply reconnects when no data is received during pause, again, a standard operation when data stops flowing. By default this timeout is set 5 seconds which must be too short and the device is reconnecting. This setting is called "Data Loss Interval" and can be found on the openHistorian Manager device configuration screen.

Suggest changing this to 15 seconds for each input device to see if data loss stops.

Thanks,
Ritchie

@yuwenpeng
Copy link
Author

yuwenpeng commented Feb 9, 2017

Added EnableThreadPoolMonitoring=;, and set 'Data loss interval' to 15 by modifying table 'Device' in configuration database directly. And openHistorian is restarted. Hope it could help.

To verify this is the root cause, turn on ThreadPoolMonitoring and restart openHistorian. You can do this by modifying your openhistorian.exe.config file. Find the line that says OptimizationsConnectionString and add EnableThreadPoolMonitoring=;

Suggest changing this to 15 seconds for each input device to see if data loss stops.

I still have some questions. According to the custom action adapter and statistic database, all the data is received by openHistorian and all devices are continually connected. And, if there is a pause, we should observe an obvious change in the time difference between time stamp (UTC) and the log time, which is the local time when the custom adapter received the time frame. But according to the log files the time difference is 2.3 seconds and didn't change.

@ritchiecarroll
Copy link
Member

I agree - that's what is strange about all this. Also, there are some very unusual messages in your error log. Perhaps the next best step is to have a WebEx and take a look?

Want to send me an e-mail about scheduling this?

Thanks,
Ritchie

@yuwenpeng
Copy link
Author

yuwenpeng commented Feb 9, 2017

@ritchiecarroll
I modified the configuration as you and Steven Chisholm suggested and openHistorian has run several hours. There are still missing data.
Thank you for taking time to help us! I'm flexible to have WebEx. What's your available time?

@ritchiecarroll
Copy link
Member

ritchiecarroll commented Feb 9, 2017

Tomorrow afternoon? Please send me an e-mail at my GPA e-mail address and we will arrange this...

@yuwenpeng
Copy link
Author

yuwenpeng commented Feb 9, 2017

@ritchiecarroll @StevenChisholm
I'm digging into the source code, and traced down to GSF.Snap.Collection.SortedPointBuffer<TKey,TValue>.TryEnqueue(Tkey,TValue> for PrebufferWriter
It seems the SortedPointBuffer received all the data. I need to go deeper to find where the data get lost..

@yuwenpeng
Copy link
Author

yuwenpeng commented Feb 11, 2017

@ritchiecarroll @StevenChisholm
It seems sometimes data may get lost when creating UnionTreeStream<TKey, TValue>
UnionTreeStream<TKey, TValue> is used to merge PendingTable to the next level or rollover pending tables from all levels to the NextStage archive. But sometimes the merging procedure stops unexpectedly and the remaining data gets lost. One example is shown below:
10 PendingTable2 are merged to PendingTable3. There are 53021 values in the 10 PendingTable2, but only 15674 of them are copied to PendingTable3. This could also happen in the rollover from FirstStage to NextStage archive.
Still don't know what is the reason which stops the merging procedure.
image

@yuwenpeng
Copy link
Author

@ritchiecarroll @StevenChisholm
Duplicated data may be the reason. But not sure yet..

@yuwenpeng
Copy link
Author

yuwenpeng commented Feb 12, 2017

@StevenChisholm
@ritchiecarroll
I think I got the bug. It is in UnionTreeStreamSortHelper.SortAssumingIncreased(int index), at line 116 as shown below.
image.
x should not start from 0 but index. Once there are duplicated keys in BufferedTreeStreams, and the duplicated key happens to be the last key of the second (or third and so on) BufferedTreeStream, this sort function will be called and make mistake.

One of the captured examples is shown below, in which 2017-02-12 00:37:22.3000000/785 is the duplicated key.
There are 3 sections:

  1. The original order of BufferedTreeStreams and their cache key and valid flag before removing duplicated key.
  2. Information of the streams after removing duplicated key by advancing the position of the second duplicated key.
  3. Information and order of streams after sorting the streams.
    It caused 3 remaining streams ignored.
    image

Then the data get lost.
image

@StevenChisholm
Copy link
Contributor

StevenChisholm commented Feb 12, 2017 via email

@yuwenpeng
Copy link
Author

yuwenpeng commented Feb 12, 2017

The problem was solved.
Sometimes meter may resend data due to network issue, this is why we get duplicated key.
Previously the data losing happened around 7 times every hour in average. Now there isn't any missing data in the last 15 hours.
Thank you for all your help! @ritchiecarroll @StevenChisholm

Best wishes,
Wenpeng Yu

@StevenChisholm
Copy link
Contributor

StevenChisholm commented Feb 12, 2017 via email

@ritchiecarroll
Copy link
Member

I'll add my thank you as well - nice job @yuwenpeng! I'll close this for now given the discovered fix:
437ed9d

Thanks again!
Ritchie

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants